A Linear-Time Algorithm for Studying Genetic Variation

The study of variation in DNA sequences, within the framework of phylogeny or population genetics, for instance, is one of the most important subjects in modern genomics. We here present a new linear-time algorithm for finding maximal k-regions in alignments of three sequences, which can be used for the detection of segments featuring a certain degree of similarity, as well as the boundaries of distinct genomic environments such as gene clusters or haplotype blocks. k-regions are defined as these which have a center sequence whose Hamming distance from any of the alignment rows is at most k, and their determination in the general case is known to be NP-hard.

[1]  Piotr Berman,et al.  A Linear-Time Algorithm for the 1-Mismatch Problem , 1997, WADS.

[2]  Esther G. L. Koh,et al.  Highly conserved syntenic blocks at the vertebrate Hox loci and conserved regulatory elements within and outside Hox gene clusters. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[3]  James P Balhoff,et al.  Evolutionary analysis of the well characterized endo16 promoter reveals substantial variation within functional sites. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Dana C Crawford,et al.  Definition and clinical importance of haplotypes. , 2005, Annual review of medicine.

[5]  Niklaus Wirth,et al.  Algorithms and Data Structures , 1989, Lecture Notes in Computer Science.

[6]  Terrence S. Furey,et al.  The UCSC Genome Browser Database: update 2006 , 2005, Nucleic Acids Res..

[7]  M. Daly,et al.  A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms , 2001, Nature.

[8]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[9]  A. Litman,et al.  On covering problems of codes , 1997, Theory of Computing Systems.

[10]  Terrence S. Furey,et al.  The UCSC Genome Browser Database , 2003, Nucleic Acids Res..

[11]  Bin Ma,et al.  Finding similar regions in many strings , 1999, STOC '99.

[12]  Bin Ma,et al.  On the closest string and substring problems , 2002, JACM.

[13]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.