A parallel and incremental algorithm for efficient unique signature discovery on DNA databases

BackgroundDNA signatures are distinct short nucleotide sequences that provide valuable information that is used for various purposes, such as the design of Polymerase Chain Reaction primers and microarray experiments. Biologists usually use a discovery algorithm to find unique signatures from DNA databases, and then apply the signatures to microarray experiments. Such discovery algorithms require to set some input factors, such as signature length l and mismatch tolerance d, which affect the discovery results. However, suggestions about how to select proper factor values are rare, especially when an unfamiliar DNA database is used. In most cases, biologists typically select factor values based on experience, or even by guessing. If the discovered result is unsatisfactory, biologists change the input factors of the algorithm to obtain a new result. This process is repeated until a proper result is obtained. Implicit signatures under the discovery condition (l, d) are defined as the signatures of length ≤ l with mismatch tolerance ≥ d. A discovery algorithm that could discover all implicit signatures, such that those that meet the requirements concerning the results, would be more helpful than one that depends on trial and error. However, existing discovery algorithms do not address the need to discover all implicit signatures.ResultsThis work proposes two discovery algorithms - the consecutive multiple discovery (CMD) algorithm and the parallel and incremental signature discovery (PISD) algorithm. The PISD algorithm is designed for efficiently discovering signatures under a certain discovery condition. The algorithm finds new results by using previously discovered results as candidates, rather than by using the whole database. The PISD algorithm further increases discovery efficiency by applying parallel computing. The CMD algorithm is designed to discover implicit signatures efficiently. It uses the PISD algorithm as a kernel routine to discover implicit signatures efficiently under every feasible discovery condition.ConclusionsThe proposed algorithms discover implicit signatures efficiently. The presented CMD algorithm has up to 97% less execution time than typical sequential discovery algorithms in the discovery of implicit signatures in experiments, when eight processing cores are used.

[1]  Sven Rahmann Rapid Large-Scale Oligonucleotide Selection for Microarrays , 2002, WABI.

[2]  Suresh Chalasani,et al.  An Incremental Algorithm for TDM Switching Assignments in Satellite and Terrestrial Networks , 1992, IEEE J. Sel. Areas Commun..

[3]  Tao Jiang,et al.  Efficient selection of unique and popular oligos for large EST databases , 2004, Bioinform..

[4]  C. Kurtzman,et al.  Rapid Identification of Candida albicansand Other Human Pathogenic Yeasts by Using Short Oligonucleotides in a PCR , 1998, Journal of Clinical Microbiology.

[5]  Dinesh Manocha,et al.  Incremental Algorithms for Collision Detection Between Polygonal Models , 1997, IEEE Trans. Vis. Comput. Graph..

[6]  Yin-Te Tsai,et al.  Efficient discovery of unique signatures on whole-genome EST databases , 2005, SAC '05.

[7]  Tianrui Li,et al.  An incremental algorithm for mining classification rules in incomplete information systems , 2004, IEEE Annual Meeting of the Fuzzy Information, 2004. Processing NAFIPS '04..

[8]  Robert D. Nowak,et al.  Quantized incremental algorithms for distributed optimization , 2005, IEEE Journal on Selected Areas in Communications.

[9]  Gary D. Stormo,et al.  Selection of optimal DNA oligos for gene expression arrays , 2001, Bioinform..

[10]  Ramy K. Aziz,et al.  Bioinformatics determination of ETEC signature genes as potential targets for molecular diagnosis and reverse vaccinology , 2009, BMC Bioinformatics.

[11]  Alexander Schliep,et al.  Selecting signature oligonucleotides to identify organisms using DNA arrays , 2002, Bioinform..

[12]  Jacques Schrenzel,et al.  Rapid bacterial identification using evanescent-waveguide oligonucleotide microarray classification. , 2006, Journal of microbiological methods.

[13]  Eric K. Nordberg,et al.  YODA: selecting signature oligonucleotides , 2005, Bioinform..

[14]  Chi Lap Yip,et al.  A comparison study on algorithms for incremental update of frequent sequences , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[15]  Oscar P. Kuipers,et al.  UniFrag and GenomePrimer: selection of primers for genome-wide production of unique amplicons , 2003, Bioinform..

[16]  Adam M. Phillippy,et al.  Insignia: a DNA signature search web server for diagnostic assay development , 2009, Nucleic Acids Res..

[17]  Patrick S. Schnable,et al.  Picky: oligo microarray design for large genomes , 2004, Bioinform..

[18]  Karin Breuer,et al.  hybseek: Pathogen primer design tool for diagnostic multi-analyte assays , 2009, Comput. Methods Programs Biomed..

[19]  Sung-Soo Kim,et al.  A New Scheme for Nucleotide Sequence Signature Extraction , 2006, 2006 5th International Conference on Machine Learning and Applications (ICMLA'06).

[20]  Jean-Marie Rouillard,et al.  OligoArray: genome-scale oligonucleotide design for microarrays , 2002, Bioinform..

[21]  Tamara Munzner,et al.  Smashing Peacocks Further: Drawing Quasi-Trees from Biconnected Components , 2006, IEEE Transactions on Visualization and Computer Graphics.

[22]  Bo Yang,et al.  Incremental algorithm for detecting community structure in dynamic networks , 2005, 2005 International Conference on Machine Learning and Cybernetics.

[23]  C. Pradier,et al.  A DNA biosensor based on peptide nucleic acids on gold surfaces. , 2007, Biosensors & bioelectronics.

[24]  Adam M. Phillippy,et al.  Comprehensive DNA Signature Discovery and Validation , 2007, PLoS Comput. Biol..

[25]  L. Deaven,et al.  Construction of a BAC contig map of chromosome 16q by two-dimensional overgo hybridization. , 2000, Genome research.

[26]  In-Beum Lee,et al.  Multiple detection of food-borne pathogenic bacteria using a novel 16S rDNA-based oligonucleotide signature chip. , 2007, Biosensors & bioelectronics.

[27]  Fu-Xiong Sun,et al.  Errors Estimating of Incompletion and Updating Strategy in IDS , 2006, 2006 International Conference on Machine Learning and Cybernetics.