Maximum-scoring segment sets

We examine the problem of finding maximum-scoring sets of disjoint segments in a sequence of scores. The problem arises in DNA and protein segmentation and in postprocessing of sequence alignments. Our key result states a simple recursive relationship between maximum-scoring segment sets. The statement leads to fast algorithms for finding such segment sets. We apply our methods to the identification of noncoding RNA genes in thermophiles

[1]  M. Gouy,et al.  A nonhyperthermophilic common ancestor to extant life forms. , 1999, Science.

[2]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[3]  Jian Wang,et al.  A complete sequence of the T. tengcongensis genome. , 2002, Genome research.

[4]  I E Auger,et al.  Algorithms for the optimal identification of segment neighborhoods. , 1989, Bulletin of mathematical biology.

[5]  Dieter Söll,et al.  The genome of Nanoarchaeum equitans: Insights into early archaeal evolution and derived parasitism , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[6]  D. Hickey,et al.  Evidence for strong selective constraint acting on the nucleotide composition of 16S ribosomal RNA genes. , 2002, Nucleic acids research.

[7]  S. Eddy,et al.  tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. , 1997, Nucleic acids research.

[8]  Miklós Csürös,et al.  Algorithms for Finding Maximal-Scoring Segment Sets (Extended Abstract) , 2004, WABI.

[9]  Dana Ron,et al.  An Experimental and Theoretical Comparison of Model Selection Methods , 1995, COLT '95.

[10]  S. Eddy,et al.  Noncoding RNA genes identified in AT-rich hyperthermophiles , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[11]  X. Huang,et al.  An algorithm for identifying regions of a DNA sequence that satisfy a content requirement , 1994, Comput. Appl. Biosci..

[12]  Piotr Berman,et al.  Post-processing long pairwise alignments , 1999, Bioinform..

[13]  Y. Kawarabayasi,et al.  Complete genome sequence of an aerobic thermoacidophilic crenarchaeon, Sulfolobus tokodaii strain7. , 2001, DNA research : an international journal for rapid publication of reports on genes and genomes.

[14]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[15]  Walter L. Ruzzo,et al.  A Linear Time Algorithm for Finding All Maximal Scoring Subsequences , 1999, ISMB.

[16]  Yaw-Ling Lin,et al.  MAVG: locating non-overlapping maximum average segments in a given sequence , 2003, Bioinform..

[17]  R. Curnow,et al.  Maximum likelihood estimation of multiple change points , 1990 .

[18]  J. Lobry,et al.  Relationships Between Genomic G+C Content, RNA Secondary Structures, and Optimal Growth Temperature in Prokaryotes , 1997, Journal of Molecular Evolution.

[19]  Yaw-Ling Lin,et al.  Efficient algorithms for locating the length-constrained heaviest segments with applications to biomolecular sequence analysis , 2002, J. Comput. Syst. Sci..

[20]  James W. Brown The ribonuclease P database , 1998, Nucleic Acids Res..

[21]  David V. Hinkley,et al.  Inference about the change-point in a sequence of binomial variables , 1970 .

[22]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[23]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[24]  G. Churchill Stochastic models for heterogeneous DNA sequences. , 1989, Bulletin of mathematical biology.

[25]  Amir Dembo,et al.  Statistical Composition of High-Scoring Segments from Molecular Sequences , 1990 .

[26]  Ivo Grosse,et al.  Applications of Recursive Segmentation to the Analysis of DNA Sequences , 2002, Comput. Chem..

[27]  Christian Zwieb,et al.  SRPDB (Signal Recognition Particle Database) , 2000, Nucleic Acids Res..

[28]  S. Eddy Non–coding RNA genes and the modern RNA world , 2001, Nature Reviews Genetics.

[29]  Paramvir S. Dehal,et al.  Mechanisms of thermal adaptation revealed from the genomes of the Antarctic Archaea Methanogenium frigidum and Methanococcoides burtonii. , 2003, Genome research.

[30]  P. Schattner Searching for RNA genes using base-composition statistics. , 2002, Nucleic acids research.

[31]  Jon Bentley,et al.  Programming pearls: algorithm design techniques , 1984, CACM.

[32]  Ömer Egecioglu,et al.  A new approach to sequence comparison: normalized sequence alignment , 2001, RECOMB.

[33]  Michael S. Waterman,et al.  Locating maximum variance segments in sequential data , 1977 .

[34]  H. Akaike A new look at the statistical model identification , 1974 .

[35]  H. Müller,et al.  Statistical methods for DNA sequence segmentation , 1998 .

[36]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[37]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.