maxAlike: maximum likelihood-based sequence reconstruction with application to improved primer design for unknown sequences

Motivation: The task of reconstructing a genomic sequence from a particular species is gaining more and more importance in the light of the rapid development of high-throughput sequencing technologies and their limitations. Applications include not only compensation for missing data in unsequenced genomic regions and the design of oligonucleotide primers for target genes in species with lacking sequence information but also the preparation of customized queries for homology searches. Results: We introduce the maxAlike algorithm, which reconstructs a genomic sequence for a specific taxon based on sequence homologs in other species. The input is a multiple sequence alignment and a phylogenetic tree that also contains the target species. For this target species, the algorithm computes nucleotide probabilities at each sequence position. Consensus sequences are then reconstructed based on a certain confidence level. For 37 out of 44 target species in a test dataset, we obtain a significant increase of the reconstruction accuracy compared to both the consensus sequence from the alignment and the sequence of the nearest phylogenetic neighbor. When considering only nucleotides above a confidence limit, maxAlike is significantly better (up to 10%) in all 44 species. The improved sequence reconstruction also leads to an increase of the quality of PCR primer design for yet unsequenced genes: the differences between the expected Tm and real Tm of the primer-template duplex can be reduced by ~26% compared with other reconstruction approaches. We also show that the prediction accuracy is robust to common distortions of the input trees. The prediction accuracy drops by only 1% on average across all species for 77% of trees derived from random genomic loci in a test dataset. Availability: maxAlike is available for download and web server at: http://rth.dk/resources/maxAlike. Contact: gorodkin@rth.dk Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[2]  Peter F. Stadler,et al.  Fragrep: E cient search for fragmented patterns in genomic sequences , 2004 .

[3]  C. Levenson,et al.  Effects of primer-template mismatches on the polymerase chain reaction: human immunodeficiency virus type 1 model studies. , 1990, Nucleic acids research.

[4]  J. Mattick,et al.  Non-coding RNA. , 2006, Human molecular genetics.

[5]  Nicolas Peyret,et al.  Thermodynamic properties of DNA sequences: characteristic values for the human genome , 2005, Bioinform..

[6]  J. Townsend,et al.  Optimal selection of gene and ingroup taxon sampling for resolving phylogenetic relationships. , 2010, Systematic biology.

[7]  Joseph W. Thornton,et al.  Resurrecting ancient genes: experimental analysis of extinct molecules , 2004, Nature Reviews Genetics.

[8]  M. Nei,et al.  Accuracies of ancestral amino acid sequences inferred by the parsimony, likelihood, and distance methods , 2009, Journal of Molecular Evolution.

[9]  W. Ian Lipkin,et al.  Greene SCPrimer: a rapid comprehensive tool for designing degenerate primers from multiple sequence alignments , 2006, Nucleic acids research.

[10]  W. L. Ruzzo,et al.  Comparative genomics beyond sequence-based alignments: RNA structures in the ENCODE regions. , 2008, Genome research.

[11]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[12]  T. Garland,et al.  Effects of branch length errors on the performance of phylogenetically independent contrasts. , 1998, Systematic biology.

[13]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[14]  Dan Gusfield,et al.  Efficient algorithms for inferring evolutionary trees , 1991, Networks.

[15]  M. Ruggero,et al.  Similarity of Traveling-Wave Delays in the Hearing Organs of Humans and Other Tetrapods , 2007, Journal for the Association for Research in Otolaryngology.

[16]  Mathieu Blanchette,et al.  Ancestors 1.0: a web server for ancestral sequence reconstruction , 2010, Bioinform..

[17]  Peter F. Stadler,et al.  Maximum Likelihood Estimation of Weight Matrices for Targeted Homology Search , 2009, GCB.

[18]  H R Garner,et al.  PRIMO: A primer design program that applies base quality statistics for automated large-scale DNA sequencing. , 1997, Genomics.

[19]  Ziheng Yang PAML 4: phylogenetic analysis by maximum likelihood. , 2007, Molecular biology and evolution.

[20]  John Healy,et al.  GapCoder automates the use of indel characters in phylogenetic analysis , 2003, BMC Bioinformatics.

[21]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[22]  Nicola Stokes,et al.  UniPrime2: a web service providing easier Universal Primer design , 2009, Nucleic Acids Res..

[23]  Michael J. Sanderson,et al.  The Small-world Dynamics of Tree Networks and Data Mining in Phyloinformatics , 2003, Bioinform..

[24]  S Rozen,et al.  Primer3 on the WWW for general users and for biologist programmers. , 2000, Methods in molecular biology.

[25]  Peter F. Stadler,et al.  Fragrep: An Efficient Search Tool for Fragmented Patterns in Genomic Sequences , 2006, Genom. Proteom. Bioinform..

[26]  Alexander E. Kel,et al.  MATCHTM: a tool for searching transcription factor binding sites in DNA sequences , 2003, Nucleic Acids Res..

[27]  Nicolas Le Novère,et al.  MELTING, computing the melting temperature of nucleic acid duplex. , 2001, Bioinformatics.

[28]  E. Birney,et al.  Genome-wide nucleotide-level mammalian ancestor reconstruction. , 2008, Genome research.

[29]  J. Gorodkin,et al.  Thousands of corresponding human and mouse genomic regions unalignable in primary sequence contain common RNA structure. , 2006, Genome research.

[30]  Bruno Contreras-Moreira,et al.  primers4clades: a web server that uses phylogenetic trees to design lineage-specific PCR primers for metagenomic and diversity studies , 2009, Nucleic Acids Res..

[31]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[32]  W. Thilly,et al.  Specificity, efficiency, and fidelity of PCR. , 1993, PCR methods and applications.

[33]  E. Wingender,et al.  MATCH: A tool for searching transcription factor binding sites in DNA sequences. , 2003, Nucleic acids research.