论文信息 - Efficient algorithms for locating the length-constrained heaviest segments with applications to biomolecular sequence analysis

Efficient algorithms for locating the length-constrained heaviest segments with applications to biomolecular sequence analysis

We study two fundamental problems concerning the search for interesting regions in sequences: (i) given a sequence of real numbers of length n and an upper bound U, find a consecutive subsequence of length at most U with the maximum sum and (ii) given a sequence of real numbers of length n and a lower bound L, find a consecutive subsequence of length at least L with the maximum average. We present an O(n)-time algorithm for the first problem and an O(n log L)-time algorithm for the second. The algorithms have potential applications in several areas of biomolecular sequence analysis including locating GC-rich regions in a genomic DNA sequence, post-processing sequence alignments, annotating multiple sequence alignments, and computing length-constrained ungapped local alignment. Our preliminary tests on both simulated and real data demonstrate that the algorithms are very efficient and able to locate useful (such as GC-rich) regions.

Yaw-Ling Lin | Tao Jiang | Kun-Mao Chao

[1] H. Prydz,et al. CpG islands as gene markers in the human genome. , 1992, Genomics.

[2] Jill P. Mesirov,et al. Human and mouse gene structure: comparative analysis and application to exon prediction , 2000, RECOMB '00.

[3] W Miller,et al. Locus control regions of mammalian beta-globin gene clusters: combining phylogenetic analyses and experimental results to gain functional insights. , 1997, Gene.

[4] Daniel H. Huson,et al. The Conserved Exon Method for Gene Finding , 2000, ISMB.

[5] Piotr Berman,et al. Post-processing long pairwise alignments , 1999, Bioinform..

[6] X. Huang,et al. An algorithm for identifying regions of a DNA sequence that satisfy a content requirement , 1994, Comput. Appl. Biosci..

[7] I. Longden,et al. EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[8] Chuan Yi Tang,et al. An Efficient Algorithm for the Length-Constrained Heaviest Path Problem on a Tree , 1999, Inf. Process. Lett..

[9] Ronald L. Rivest,et al. Introduction to Algorithms , 1990 .

[10] N N Alexandrov,et al. Statistical significance of ungapped sequence alignments. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[11] S Schwartz,et al. Sequence and comparative analysis of the rabbit alpha-like globin gene cluster reveals a rapid mode of evolution in a G + C-rich region of mammalian genomes. , 1991, Journal of molecular biology.

[12] M. Frommer,et al. CpG islands in vertebrate genomes. , 1987, Journal of molecular biology.

[13] M. Waterman,et al. A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. , 1987, Journal of molecular biology.

[14] M S Waterman,et al. Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[15] W. Miller,et al. Comparison of five methods for finding conserved sequences in multiple alignments of gene regulatory regions. , 1999, Nucleic acids research.

[16] M S Boguski,et al. Analysis of conserved domains and sequence motifs in cellular regulatory proteins and locus control regions using new software tools for multiple alignment and visualization. , 1992, The New biologist.

[17] Sridhar Hannenhalli,et al. Promoter prediction in the human genome , 2001, ISMB.

[18] P. Sellers. Pattern recognition in genetic sequences by mismatch density , 1984 .

[19] A. Nekrutenko,et al. Assessment of compositional heterogeneity within and between eukaryotic genomes. , 2000, Genome research.

[20] Ömer Egecioglu,et al. Algorithms for Local Alignment with Length Constraints , 2002, LATIN.

[21] Ömer Egecioglu,et al. A new approach to sequence comparison: normalized sequence alignment , 2001, Bioinform..

[22] Piotr Berman,et al. Alignments without Low-Scoring Regions , 1998, J. Comput. Biol..

[23] Jon Louis Bentley,et al. Programming pearls , 1987, CACM.