Monotony of surprise and large-scale quest for unusual words

The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In Molecular Biology, exceptionally frequent or rare words in bio-sequences have been implicated in various facets of biological function and structure. The discovery, particularly on a massive scale, of such patterns poses interesting methodological and algorithmic problems, and often exposes scenarios in which tables and synopses grow faster and bigger than the raw sequences they are meant to encapsulate. In previous study, the ability to succinctly compute, store, and display unusual substrings has been linked to a subtle interplay between the combinatorics of the subwords of a word and local monotonicities of some scores used to measure the departure from expectation. In this paper, we carry out an extensive analysis of such monotonicities for a broader variety of scores. This supports the construction of data structures and algorithms capable of performing global detection of unusual substrings in time and space linear in the subject sequences, under various probabilistic models.

[1]  Z. Galil,et al.  Pattern matching algorithms , 1997 .

[2]  David Haussler,et al.  Complete inverted files for efficient text retrieval and analysis , 1987, JACM.

[3]  Stefano Lonardi,et al.  Global detectors of unusual words: design, implementation, and applications to pattern discovery in biosequences , 2001 .

[4]  P. Pevzner,et al.  Linguistics of nucleotide sequences. I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words. , 1989, Journal of biomolecular structure & dynamics.

[5]  U Grob,et al.  Statistical analysis of nucleotide sequences. , 1990, Nucleic acids research.

[6]  Stefano Lonardi,et al.  Efficient Detection of Unusual Words , 2000, J. Comput. Biol..

[7]  Saurabh Sinha,et al.  A Statistical Method for Finding Transcription Factor Binding Sites , 2000, ISMB.

[8]  Alberto Apostolico Of maps bigger than the empire , 2001, Proceedings Eighth Symposium on String Processing and Information Retrieval.

[9]  Mark Borodovsky,et al.  First and second moment of counts of words in random texts generated by Markov chains , 1992, Comput. Appl. Biosci..

[10]  Stefano Lonardi,et al.  A speed-up for the commute between subword trees and DAWGs , 2002, Inf. Process. Lett..

[11]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[12]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[13]  Mireille Régnier,et al.  On Pattern Frequency Occurrences in a Markovian Sequence , 1998, Algorithmica.

[14]  Gesine Reinert,et al.  Probabilistic and Statistical Properties of Words: An Overview , 2000, J. Comput. Biol..

[15]  David Haussler,et al.  Sequence landscapes , 1986, Nucleic Acids Res..

[16]  Jane F. Gentleman The Distribution of the Frequency of Subsequences in Alphabetic Sequences, as Exemplified by Deoxyribonucleic Acid , 1994 .

[17]  Alberto Apostolico,et al.  Annotated statistical indices for sequence analysis , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[18]  Terence P. Speed,et al.  Over- and Underrepresentation of Short DNA Words in Herpesvirus Genomes , 1996, J. Comput. Biol..

[19]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[20]  Vineet Bafna,et al.  Pattern Matching Algorithms , 1997 .

[21]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[22]  Jorge Luis Borges,et al.  A Universal History of Infamy , 1935 .