Phred-Phrap package to analyses tools: a pipeline to facilitate population genetics re-sequencing studies

BackgroundTargeted re-sequencing is one of the most powerful and widely used strategies for population genetics studies because it allows an unbiased screening for variation that is suitable for a wide variety of organisms. Examples of studies that require re-sequencing data are evolutionary inferences, epidemiological studies designed to capture rare polymorphisms responsible for complex traits and screenings for mutations in families and small populations with high incidences of specific genetic diseases. Despite the advent of next-generation sequencing technologies, Sanger sequencing is still the most popular approach in population genetics studies because of the widespread availability of automatic sequencers based on capillary electrophoresis and because it is still less prone to sequencing errors, which is critical in population genetics studies. Two popular software applications for re-sequencing studies are Phred-Phrap-Consed-Polyphred, which performs base calling, alignment, graphical edition and genotype calling and DNAsp, which performs a set of population genetics analyses. These independent tools are the start and end points of basic analyses. In between the use of these tools, there is a set of basic but error-prone tasks to be performed with re-sequencing data.ResultsIn order to assist with these intermediate tasks, we developed a pipeline that facilitates data handling typical of re-sequencing studies. Our pipeline: (1) consolidates different outputs produced by distinct Phred-Phrap-Consed contigs sharing a reference sequence; (2) checks for genotyping inconsistencies; (3) reformats genotyping data produced by Polyphred into a matrix of genotypes with individuals as rows and segregating sites as columns; (4) prepares input files for haplotype inferences using the popular software PHASE; and (5) handles PHASE output files that contain only polymorphic sites to reconstruct the inferred haplotypes including polymorphic and monomorphic sites as required by population genetics software for re-sequencing data such as DNAsp.ConclusionWe tested the pipeline in re-sequencing studies of haploid and diploid data in humans, plants, animals and microorganisms and observed that it allowed a substantial decrease in the time required for sequencing analyses, as well as being a more controlled process that eliminates several classes of error that may occur when handling datasets. The pipeline is also useful for investigators using other tools for sequencing and population genetics analyses.

[1]  Ryan D. Hernandez,et al.  Inferring the Joint Demographic History of Multiple Populations from Multidimensional SNP Frequency Data , 2009, PLoS genetics.

[2]  FORMATOMATIC: a program for converting diploid allelic data between common formats for population genetic analysis. , 2007, Molecular ecology notes.

[3]  Hugues Sicotte,et al.  SNP500Cancer: a public resource for sequence validation, assay development, and frequency analysis for genetic variation in candidate genes , 2005, Nucleic Acids Res..

[4]  R. M. L. Novaes,et al.  Phylogeography of Plathymenia reticulata (Leguminosae) reveals patterns of recent range expansion towards northeastern Brazil and southern Cerrados in Eastern Tropical South America , 2010, Molecular ecology.

[5]  M. Rieder,et al.  Estimating coverage and power for genetic association studies using near-complete variation data , 2008, Nature Genetics.

[6]  Kevin Thornton,et al.  libsequence: a C++ class library for evolutionary genetic analysis , 2003, Bioinform..

[7]  Andrew G. Clark,et al.  Darwinian and demographic forces affecting human protein coding genes. , 2009, Genome research.

[8]  D. Nickerson,et al.  PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing. , 1997, Nucleic acids research.

[9]  L. Excoffier,et al.  Statistical evaluation of alternative models of human evolution , 2007, Proceedings of the National Academy of Sciences.

[10]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[11]  S. Tishkoff,et al.  Divergent patterns of linkage disequilibrium and haplotype structure across global populations at the interleukin-13 (IL13) locus , 2005, Genes and Immunity.

[12]  F. Salzano,et al.  Analysis of nucleotide diversity of NAT2 coding region reveals homogeneity across Native American populations and high intra-population diversity , 2007, The Pharmacogenomics Journal.

[13]  Timothy B. Stockwell,et al.  Evaluation of next generation sequencing platforms for population targeted sequencing studies , 2009, Genome Biology.

[14]  Xavier Messeguer,et al.  DnaSP, DNA polymorphism analyses by the coalescent and other methods , 2003, Bioinform..

[15]  B. Budowle,et al.  Extracting evidence from forensic DNA analyses: future molecular biology directions. , 2009, BioTechniques.

[16]  R. Chakraborty,et al.  Texas Population Substructure and Its Impact on Estimating the Rarity of Y STR Haplotypes from DNA Evidence * , 2009, Journal of forensic sciences.

[17]  A. Hughes,et al.  Polymorphism at the apical membrane antigen 1 locus reflects the world population history of Plasmodium vivax , 2008, BMC Evolutionary Biology.

[18]  P. Donnelly,et al.  A new statistical method for haplotype reconstruction from population data. , 2001, American journal of human genetics.

[19]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[20]  S. Chanock,et al.  Diversity in the Glucose Transporter-4 Gene (SLC2A4) in Humans Reflects the Action of Natural Selection along the Old-World Primates Evolution , 2010, PloS one.

[21]  Jeremiah D. Degenhardt,et al.  Targets of balancing selection in the human genome. , 2009, Molecular biology and evolution.

[22]  Heather A. Halvensleben,et al.  A comprehensive resequence analysis of the KLK15–KLK3–KLK2 locus on chromosome 19q13.33 , 2009, Human Genetics.

[23]  Albert J. Vilella,et al.  VariScan: Analysis of evolutionary patterns from large-scale DNA sequence polymorphism data , 2005, Bioinform..

[24]  R. Wilson,et al.  Cancer genome sequencing: a review. , 2009, Human molecular genetics.

[25]  A. Godard,et al.  Mutation in intron 5 of GTP cyclohydrolase 1 gene causes dopa-responsive dystonia (Segawa syndrome) in a Brazilian family. , 2008, Genetics and molecular research : GMR.

[26]  Paul Scheet,et al.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. , 2006, American journal of human genetics.

[27]  R. Kucherlapati,et al.  PolyPhred Analysis Software for Mutation Detection from Fluorescence‐Based Sequence Data , 2008, Current protocols in human genetics.

[28]  P. Green,et al.  Consed: a graphical tool for sequence finishing. , 1998, Genome research.

[29]  F. R. Santos,et al.  Genetic diversity and origin of leatherback turtles (Dermochelys coriacea) from the Brazilian coast. , 2008, The Journal of heredity.

[30]  C. Carlson,et al.  Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. , 2004, American journal of human genetics.

[31]  P. Fariselli,et al.  Evolution of detoxifying systems: the role of environment and population history in shaping genetic diversity at human CYP2D6 locus , 2010, Pharmacogenetics and genomics.

[32]  Carlos Bustamante,et al.  Genomic scans for selective sweeps using SNP data. , 2005, Genome research.

[33]  S. Chanock,et al.  CYBB, an NADPH‐oxidase gene: restricted diversity in humans and evidence for differential long‐term purifying selection on transmembrane and cytosolic domains , 2008, Human mutation.

[34]  Wei Zheng,et al.  A genome-wide association study identifies pancreatic cancer susceptibility loci on chromosomes 13q22.1, 1q32.1 and 5p15.33 , 2010, Nature Genetics.