DNA Sequence Recognition by Hybridization to Short Oligomers

A format 1 technology for performing massive hybridization experiments has been developed as part of the sequencing by hybridization (SBH) project. Arrays of tens of thousands of clones are interrogated with short oligomer probes to determine sets of oligomers that are present in individual clones. SBH requires highly discriminative hybridizations with a large number of probes. One of the main uses of a reconstructed DNA sequence is in a similarity search against databases of known DNA. We argue that sequence reconstruction, even partial, should not be performed for this particular purpose; we provide an information-theoretic proof that the oligomer lists obtained from hybridization experiments should be used directly for similarity searches. We propose a similarity search method that takes full advantage of the subword structure of positively identified oligomers within a clone. The method tolerates error in hybridization experiments, requires fewer probes than necessary for sequencing, and is computationally efficient. To enable direct sequence recognition, we apply the recently developed method of sequence comparison that is based on minimal length encoding and algorithimic mutual information. The method has been tested on both real and simulated data and has led to a correct identification of clones based on hybridizations with 109 short oligomer probes. The method is applicable to hybridization data that comes from both format 1 and format 2 (sequencing chip) hybridization experiments. The sequence recognition method can provide targeting information for large-scale DNA sequencing by gel-based methods or by hybridization.

[1]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[2]  Pavel A. Pevzner,et al.  Statistical distance between texts and filtration methods in sequence comparison , 1992, Comput. Appl. Biosci..

[3]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Winston Hide,et al.  Biological Evaluation of d2, an Algorithm for High-Performance Sequence Comparison , 1994, J. Comput. Biol..

[5]  A. Milosavljevic,et al.  Clone clustering by hybridization. , 1995, Genomics.

[6]  G. Lennon,et al.  Hybridization analyses of arrayed cDNA libraries. , 1991, Trends in genetics : TIG.

[7]  G. Fichant,et al.  Fast identification of repetitive elements in biological sequences. , 1994, Journal of theoretical biology.

[8]  L. Hood,et al.  DNA sequence determination by hybridization: a strategy for efficient large-scale sequencing. , 1993, Science.

[9]  Pavel A. Pevzner,et al.  Towards DNA Sequencing Chips , 1994, MFCS.

[10]  R. Drmanac,et al.  An algorithm for the DNA sequence generation from k-tuple word contents of the minimal number of random fragments. , 1991, Journal of biomolecular structure & dynamics.

[11]  Aleksandar Milosavljevic,et al.  Discovering Sequence Similarity by the Algorithmic Significance Method , 1993, ISMB.

[12]  Marcella Attimonelli,et al.  A simple method for global sequence comparison , 1992, Nucleic Acids Res..

[13]  Gregory J. Chaitin,et al.  Algorithmic Information Theory , 1987, IBM J. Res. Dev..

[14]  David Haussler,et al.  The Smallest Automaton Recognizing the Subwords of a Text , 1985, Theor. Comput. Sci..

[15]  James A. Storer,et al.  Data Compression: Methods and Theory , 1987 .

[16]  R. Gibbs,et al.  A transposon-like element in the deletion-prone region of the dystrophin gene. , 1992, Genomics.

[17]  B. Blaisdell A measure of the similarity of sets of sequences not requiring sequence alignment. , 1986, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Aleksandar Milosavljevic,et al.  Discovering simple DNA sequences by the algorithmic significance method , 1993, Comput. Appl. Biosci..

[19]  E N Trifonov,et al.  Linguistic measure of taxonomic and functional relatedness of nucleotide sequences. , 1990, Journal of biomolecular structure & dynamics.