Maximum-Scoring Segment Sets

We examine the problem of finding maximum-scoring sets of disjoint segments in a sequence of scores. The problem arises in DNA and protein segmentation and in postprocessing of sequence alignments. Our key result states a simple recursive relationship between maximum-scoring segment sets. The statement leads to fast algorithms for finding such segment sets. We apply our methods to the identification of noncoding RNA genes in thermophiles.

[1]  Walter L. Ruzzo,et al.  A Linear Time Algorithm for Finding All Maximal Scoring Subsequences , 1999, ISMB.

[2]  J. Lobry,et al.  Relationships Between Genomic G+C Content, RNA Secondary Structures, and Optimal Growth Temperature in Prokaryotes , 1997, Journal of Molecular Evolution.

[3]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[4]  James W. Brown,et al.  The Ribonuclease P Database , 1994, Nucleic Acids Res..

[5]  D. Hickey,et al.  Evidence for strong selective constraint acting on the nucleotide composition of 16S ribosomal RNA genes. , 2002, Nucleic acids research.

[6]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[7]  Christian Zwieb,et al.  SRPDB: Signal Recognition Particle Database , 2003, Nucleic Acids Res..

[8]  Miklós Csürös,et al.  Algorithms for Finding Maximal-Scoring Segment Sets (Extended Abstract) , 2004, WABI.

[9]  Piotr Berman,et al.  Post-processing long pairwise alignments , 1999, Bioinform..

[10]  S. Karlin,et al.  Chance and statistical significance in protein and DNA sequence analysis. , 1992, Science.

[11]  Jon Bentley,et al.  Programming pearls: algorithm design techniques , 1984, CACM.

[12]  Miklós Csűrös,et al.  Algorithms for Finding Maximal-Scoring Segment Sets , 2004 .

[13]  R. Curnow,et al.  Maximum likelihood estimation of multiple change points , 1990 .

[14]  Ömer Egecioglu,et al.  A new approach to sequence comparison: normalized sequence alignment , 2001, Bioinform..

[15]  Dieter Söll,et al.  The genome of Nanoarchaeum equitans: Insights into early archaeal evolution and derived parasitism , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Michael S. Waterman,et al.  Locating maximum variance segments in sequential data , 1977 .

[17]  X. Huang,et al.  An algorithm for identifying regions of a DNA sequence that satisfy a content requirement , 1994, Comput. Appl. Biosci..

[18]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[19]  H. Akaike A new look at the statistical model identification , 1974 .

[20]  I E Auger,et al.  Algorithms for the optimal identification of segment neighborhoods. , 1989, Bulletin of mathematical biology.

[21]  D. Iglehart Extreme Values in the GI/G/1 Queue , 1972 .

[22]  S. Eddy,et al.  Noncoding RNA genes identified in AT-rich hyperthermophiles , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[23]  S. Eddy,et al.  tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. , 1997, Nucleic acids research.

[24]  M. Gouy,et al.  A nonhyperthermophilic common ancestor to extant life forms. , 1999, Science.

[25]  Jian Wang,et al.  A complete sequence of the T. tengcongensis genome. , 2002, Genome research.

[26]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[27]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[28]  G. Churchill Stochastic models for heterogeneous DNA sequences. , 1989, Bulletin of mathematical biology.

[29]  Amir Dembo,et al.  Statistical Composition of High-Scoring Segments from Molecular Sequences , 1990 .

[30]  Ivo Grosse,et al.  Applications of Recursive Segmentation to the Analysis of DNA Sequences , 2002, Comput. Chem..

[31]  H. Müller,et al.  Statistical methods for DNA sequence segmentation , 1998 .

[32]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[33]  Paramvir S. Dehal,et al.  Mechanisms of thermal adaptation revealed from the genomes of the Antarctic Archaea Methanogenium frigidum and Methanococcoides burtonii. , 2003, Genome research.

[34]  P. Schattner Searching for RNA genes using base-composition statistics. , 2002, Nucleic acids research.

[35]  Yaw-Ling Lin,et al.  Efficient algorithms for locating the length-constrained heaviest segments with applications to biomolecular sequence analysis , 2002, J. Comput. Syst. Sci..

[36]  S. Eddy Non–coding RNA genes and the modern RNA world , 2001, Nature Reviews Genetics.

[37]  David V. Hinkley,et al.  Inference about the change-point in a sequence of binomial variables , 1970 .

[38]  Dana Ron,et al.  An experimental and theoretical comparison of model selection methods , 1995, COLT '95.

[39]  Y. Kawarabayasi,et al.  Complete genome sequence of an aerobic thermoacidophilic crenarchaeon, Sulfolobus tokodaii strain7. , 2001, DNA research : an international journal for rapid publication of reports on genes and genomes.

[40]  Yaw-Ling Lin,et al.  MAVG: locating non-overlapping maximum average segments in a given sequence , 2003, Bioinform..