Deconvolving sequence variation in mixed DNA populations

We present an original approach to identifying sequence variants in a mixed DNA population from sequence trace data. The heart of the method is based on parsimony: given a wildtype DNA sequence, a set of observed variations at each position collected from sequencing data, and a complete catalog of all possible mutations, determine the smallest set of mutations from the catalog that could fully explain the observed variations. The algorithmic complexity of the problem is analyzed for several classes of mutations, including block substitutions, single-range deletions, and single-range insertions. The reconstruction problem is shown to be NP-complete for single-range insertions and deletions, while for block substitutions, single character insertion, and single character deletion mutations, polynomial time algorithms are provided. Once a minimum set of mutations compatible with the observed sequence is found, the relative frequency of those mutations is recovered by solving a system of linear equations. Simulation results show the algorithm successfully deconvolving mutations in p53 known to cause cancer. An extension of the algorithm is proposed as a new method of high throughput screening for single nucleotide polymorphisms by multiplexing DNA.

[1]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[2]  N. Friedman,et al.  Tissue Classi cation with Gene Expression Pro les , 2004 .

[3]  S. P. Fodor,et al.  Light-directed, spatially addressable parallel chemical synthesis. , 1991, Science.

[4]  R. Tang,et al.  Mutations of p53 gene in human colorectal cancer: Distinct frameshifts among populations , 2001, International journal of cancer.

[5]  Thierry Soussi,et al.  P53 Gene Mutation: Software and Database , 1998, Nucleic Acids Res..

[6]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[7]  M. Daly,et al.  High-resolution haplotype structure in the human genome , 2001, Nature Genetics.

[8]  Thierry Soussi,et al.  P53 Gene Mutation: Software and Database , 1996, Nucleic Acids Res..

[9]  K. Kidd,et al.  HAPLO: a program using the EM algorithm to estimate the frequencies of multi-site haplotypes. , 1995, The Journal of heredity.

[10]  M Morris,et al.  Basecalling with LifeTrace. , 2001, Genome research.

[11]  Ruggero Montesano,et al.  IARC p53 mutation database: A relational database to compile and analyze p53 mutations in human tumors and cell lines , 1999, Human mutation.

[12]  Nir Friedman,et al.  Tissue classification with gene expression profiles. , 2000 .

[13]  L. Excoffier,et al.  Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. , 1995, Molecular biology and evolution.

[14]  K K Kidd,et al.  The accuracy of statistical methods for estimation of haplotype frequencies: an example from the CD4 locus. , 2000, American journal of human genetics.

[15]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[16]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[17]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.