Linear-time algorithms for computing maximum-density sequence segments with bioinformatics applications

We study an abstract optimization problem arising from biomolecular sequence analysis. For a sequence A of pairs (a"i,w"i) for i=1,...,n and w"i>0, a segmentA(i,j) is a consecutive subsequence of A starting with index i and ending with index j. The width of A(i,j) is w(i,j)=@?"i"=<"k"=<"jw"k, and the density is (@?"i"=<"k"=<"ja"k)/w(i,j). The maximum-density segment problem takes A and two values L and U as input and asks for a segment of A with the largest possible density among those of width at least L and at most U. When U is unbounded, we provide a relatively simple, O(n)-time algorithm, improving upon the O(nlogL)-time algorithm by Lin, Jiang and Chao. We then extend this result, providing an O(n)-time algorithm for the case when both L and U are specified.

[1]  Howard Ochman,et al.  Isochores result from mutation not selection , 1999, Nature.

[2]  R. K. Assoian,et al.  A GC-rich domain with bifunctional effects on mRNA and protein levels: implications for control of transforming growth factor beta 1 expression , 1993, Molecular and cellular biology.

[3]  G Bernardi,et al.  Isochores and the evolutionary genomics of vertebrates. , 2000, Gene.

[4]  A. Sobel,et al.  The Journal of Biological Chemistry. , 2009, Nutrition reviews.

[5]  Sung Kwon Kim,et al.  Linear-time algorithm for finding a maximum-density segment of a sequence , 2003, Inf. Process. Lett..

[6]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[7]  ScienceDirect Bulletin of mathematical biology , 1973 .

[8]  P. Guldberg,et al.  Detection of mutations in GC-rich DNA by bisulphite denaturing gradient gel electrophoresis. , 1998, Nucleic acids research.

[9]  R. Novick,et al.  Why is the initiation nick site of an AT‐rich rolling circle plasmid at the tip of a GC‐rich cruciform? , 1997, The EMBO journal.

[10]  Ming-Yang Kao,et al.  Fast Algorithms for Finding Maximum-Density Segments of a Sequence with Applications to Bioinformatics , 2002, WABI.

[11]  N N Alexandrov,et al.  Statistical significance of ungapped sequence alignments. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[12]  IF You Discover,et al.  THE BIOLOGICAL SCIENCE. , 1923, Science.

[13]  G Bernardi,et al.  The distribution of interspersed repeats is nonuniform and conserved in the mouse and human genomes. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Jon Louis Bentley,et al.  Programming pearls , 1987, CACM.

[15]  A. Clark,et al.  Local rates of recombination are positively correlated with GC content in the human genome. , 2001, Molecular biology and evolution.

[16]  P. Sharp,et al.  DNA sequence evolution: the sounds of silence. , 1995, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[17]  G Bernardi,et al.  The gene distribution of the human genome. , 1996, Gene.

[18]  L. Duret,et al.  Statistical analysis of vertebrate sequences reveals that long genes are scarce in GC-rich isochores , 1995, Journal of Molecular Evolution.

[19]  Brian Charlesworth,et al.  Genetic Recombination: Patterns in the genome , 1994, Current Biology.

[20]  A. Nekrutenko,et al.  Assessment of compositional heterogeneity within and between eukaryotic genomes. , 2000, Genome research.

[21]  J. Lakowicz,et al.  Texture Analysis of Fluorescence Lifetime Images of AT- and GC-rich Regions in Nuclei , 2001, The journal of histochemistry and cytochemistry : official journal of the Histochemistry Society.

[22]  Hsueh-I Lu,et al.  An Optimal Algorithm for the Maximum-Density Segment Problem , 2003, ESA.

[23]  J. Mattick,et al.  Genome research , 1990, Nature.

[24]  Chris A. Fields,et al.  gm: a practical tool for automating DNA sequence analysis , 1990, Comput. Appl. Biosci..

[25]  G Bernardi,et al.  An approach to the organization of eukaryotic genomes at a macromolecular level. , 1976, Journal of molecular biology.

[26]  A. Meyers Reading , 1999, Language Teaching.

[27]  K Ikehara,et al.  A possible origin of newly-born bacterial genes: significance of GC-rich nonstop frame on antisense strand. , 1996, Nucleic acids research.

[28]  G. Bernardi,et al.  Compositional constraints and genome evolution , 2005, Journal of Molecular Evolution.

[29]  Mike O'Donnell,et al.  Resolving a Fidelity Paradox , 2002, The Journal of Biological Chemistry.

[30]  X. Huang,et al.  An algorithm for identifying regions of a DNA sequence that satisfy a content requirement , 1994, Comput. Appl. Biosci..

[31]  J. Osinga,et al.  Improved mutation detection in GC-rich DNA fragments by combined DGGE and CDGE. , 1999, Nucleic acids research.

[32]  G. Holmquist,et al.  Chromosome bands, their chromatin flavors, and their functional features. , 1992, American journal of human genetics.

[33]  A. Eyre-Walker,et al.  Evidence that both G + C rich and G + C poor isochores are replicated early and late in the cell cycle. , 1992, Nucleic acids research.

[34]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[35]  N. Sueoka Directional mutation pressure and neutral molecular evolution. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Ross B. Inman,et al.  A denaturation map of the λ phage DNA molecule determined by electron microscopy , 1966 .

[37]  S Schwartz,et al.  Sequence and comparative analysis of the rabbit alpha-like globin gene cluster reveals a rapid mode of evolution in a G + C-rich region of mammalian genomes. , 1991, Journal of molecular biology.

[38]  Wen-Hsiung Li,et al.  Mutation rates differ among regions of the mammalian genome , 1989, Nature.

[39]  W. Miller,et al.  Comparison of five methods for finding conserved sequences in multiple alignments of gene regulatory regions. , 1999, Nucleic acids research.

[40]  Yaw-Ling Lin,et al.  Efficient algorithms for locating the length-constrained heaviest segments with applications to biomolecular sequence analysis , 2002, J. Comput. Syst. Sci..

[41]  P. Sellers Pattern recognition in genetic sequences by mismatch density , 1984 .

[42]  Adam Eyre-Walker,et al.  Recombination and mammalian genome evolution , 1993, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[43]  W Henke,et al.  Betaine improves the PCR amplification of GC-rich DNA sequences. , 1997, Nucleic acids research.

[44]  G. Owens,et al.  Interaction of CArG Elements and a GC-rich Repressor Element in Transcriptional Regulation of the Smooth Muscle Myosin Heavy Chain Gene in Vascular Smooth Muscle Cells* , 1997, The Journal of Biological Chemistry.

[45]  Ronald I. Greenberg,et al.  Fast and Space-Efficient Location of Heavy or Dense Segments in Run-Length Encoded Sequences: (Extended Abstract) , 2003, COCOON.

[46]  A. R. Wagner Molecular Biology and Evolution , 2001 .

[47]  J. Filipski,et al.  Correlation between molecular clock ticking, codon usage, fidelity of DNA repair, chromosome banding and chromatin compactness in germline cells , 1987, FEBS letters.