Efficient algorithms for locating the length-constrained heaviest segments with applications to biomolecular sequence analysis

We study two fundamental problems concerning the search for interesting regions in sequences: (i) given a sequence of real numbers of length n and an upper bound U, find a consecutive subsequence of length at most U with the maximum sum and (ii) given a sequence of real numbers of length n and a lower bound L, find a consecutive subsequence of length at least L with the maximum average. We present an O(n)-time algorithm for the first problem and an O(n log L)-time algorithm for the second. The algorithms have potential applications in several areas of biomolecular sequence analysis including locating GC-rich regions in a genomic DNA sequence, post-processing sequence alignments, annotating multiple sequence alignments, and computing length-constrained ungapped local alignment. Our preliminary tests on both simulated and real data demonstrate that the algorithms are very efficient and able to locate useful (such as GC-rich) regions.

[1]  H. Prydz,et al.  CpG islands as gene markers in the human genome. , 1992, Genomics.

[2]  Jill P. Mesirov,et al.  Human and mouse gene structure: comparative analysis and application to exon prediction , 2000, RECOMB '00.

[3]  W Miller,et al.  Locus control regions of mammalian beta-globin gene clusters: combining phylogenetic analyses and experimental results to gain functional insights. , 1997, Gene.

[4]  Daniel H. Huson,et al.  The Conserved Exon Method for Gene Finding , 2000, ISMB.

[5]  Piotr Berman,et al.  Post-processing long pairwise alignments , 1999, Bioinform..

[6]  X. Huang,et al.  An algorithm for identifying regions of a DNA sequence that satisfy a content requirement , 1994, Comput. Appl. Biosci..

[7]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[8]  Chuan Yi Tang,et al.  An Efficient Algorithm for the Length-Constrained Heaviest Path Problem on a Tree , 1999, Inf. Process. Lett..

[9]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[10]  N N Alexandrov,et al.  Statistical significance of ungapped sequence alignments. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[11]  S Schwartz,et al.  Sequence and comparative analysis of the rabbit alpha-like globin gene cluster reveals a rapid mode of evolution in a G + C-rich region of mammalian genomes. , 1991, Journal of molecular biology.

[12]  M. Frommer,et al.  CpG islands in vertebrate genomes. , 1987, Journal of molecular biology.

[13]  M. Waterman,et al.  A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. , 1987, Journal of molecular biology.

[14]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[15]  W. Miller,et al.  Comparison of five methods for finding conserved sequences in multiple alignments of gene regulatory regions. , 1999, Nucleic acids research.

[16]  M S Boguski,et al.  Analysis of conserved domains and sequence motifs in cellular regulatory proteins and locus control regions using new software tools for multiple alignment and visualization. , 1992, The New biologist.

[17]  Sridhar Hannenhalli,et al.  Promoter prediction in the human genome , 2001, ISMB.

[18]  P. Sellers Pattern recognition in genetic sequences by mismatch density , 1984 .

[19]  A. Nekrutenko,et al.  Assessment of compositional heterogeneity within and between eukaryotic genomes. , 2000, Genome research.

[20]  Ömer Egecioglu,et al.  Algorithms for Local Alignment with Length Constraints , 2002, LATIN.

[21]  Ömer Egecioglu,et al.  A new approach to sequence comparison: normalized sequence alignment , 2001, Bioinform..

[22]  Piotr Berman,et al.  Alignments without Low-Scoring Regions , 1998, J. Comput. Biol..

[23]  Jon Louis Bentley,et al.  Programming pearls , 1987, CACM.