Averaging measurement strategies for identifying single nucleotide polymorphisms from redundant data sets

Single nucleotide polymorphisms (SNPs) studies have been an active topic of research in the life sciences in recent years. Because SNPs are abundant, stable and sometimes can be related to specific diseases, they have been widely selected as biomarkers for multi-purpose research. As traditional methods for identifying SNPs are time-consuming and expensive, discovering SNPs from expressed sequence tags (ESTs) has became an alternative efficient way. As most EST databases do not store quality/trace files together with EST reads, several methods, like Phard, which requires corresponding sequences quality files, will not be suitable for further research purpose. Thus, computational methods that are able to obtain reliable SNPs without the need for trace/quality information are still essential. We have developed a pipeline framework, called PFSNP, to reveal reliable SNPs from EST data sets without the association of trace/quality files. PFSNP deploys several strategies, like modified neighborhood quality standard measurement and fuzzy logic, in this framework. Also, it automatically adjusts the slide window to efficiently fit different conditions of data sets. PFSNP is demonstrated by identifying SNPs from two subgroups of Oryza sativa with two different strategies as well as zebrafish. Based on our experimental results, PFSNP can obtain higher reliable results when compared to existing methods.

[1]  Taylor J. Maxwell,et al.  Inferring population mutation rate and sequencing error rate using the SNP frequency spectrum in a sample of DNA sequences. , 2009, Molecular biology and evolution.

[2]  J. Batley,et al.  Mining for Single Nucleotide Polymorphisms and Insertions/Deletions in Maize Expressed Sequence Tag Data1 , 2003, Plant Physiology.

[3]  R. Griffiths,et al.  Archaic African and Asian lineages in the genetic ancestry of modern humans. , 1997, American journal of human genetics.

[4]  Jack A. M. Leunissen,et al.  A pipeline for high throughput detection and mapping of SNPs from EST databases , 2010, Molecular Breeding.

[5]  Jörg Schmidtke,et al.  An estimate of unique DNA sequence heterozygosity in the human genome , 2004, Human Genetics.

[6]  Eric S. Lander,et al.  An SNP map of the human genome generated by reduced representation shotgun sequencing , 2000, Nature.

[7]  Jack A. M. Leunissen,et al.  QualitySNP: a pipeline for detecting single nucleotide polymorphisms and insertions/deletions in EST data from diploid and polyploid species , 2006, BMC Bioinformatics.

[8]  Brandon S. Gaut,et al.  Patterns of DNA sequence polymorphism along chromosome 1 of maize (Zea mays ssp. mays L.) , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Mark J. Schreiber,et al.  Establishment of a pipeline to analyse non-synonymous SNPs in Bos taurus , 2006, BMC Genomics.

[10]  David Edwards,et al.  Redundancy based detection of sequence polymorphisms in expressed sequence tag data using autoSNP , 2003, Bioinform..

[11]  Ben Vosman,et al.  HaploSNPer: a web-based allele and SNP detection tool , 2008, BMC Genetics.

[12]  Sigbjørn Lien,et al.  SNP detection exploiting multiple sources of redundancy in large EST collections improves validation rates , 2007, Bioinform..

[13]  A. Graner,et al.  Snipping polymorphisms from large EST collections in barley (Hordeum vulgare L.) , 2003, Molecular Genetics and Genomics.

[14]  David Wood,et al.  AutoSNPdb: an annotated single nucleotide polymorphism database for crop plants , 2008, Nucleic Acids Res..

[15]  K. Pruitt Webwise: guide to the institute for genomic research web site. , 1998, Genome Research.

[16]  Wen-Hsiung Li,et al.  Low nucleotide diversity in man. , 1991, Genetics.

[17]  Gabor T. Marth,et al.  A general approach to single-nucleotide polymorphism discovery , 1999, Nature Genetics.

[18]  Jan van Oeveren,et al.  Mining SNPs from DNA sequence data; computational approaches to SNP discovery and analysis. , 2009, Methods in molecular biology.

[19]  Marek J. Sergot,et al.  SEAN: SNP prediction and display program utilizing EST sequence clusters , 2006, Bioinform..

[20]  Q Zou,et al.  Mining SNPs from EST sequences using filters and ensemble classifiers. , 2010, Genetics and molecular research : GMR.

[21]  Philippe Chaumeil,et al.  Automated SNP Detection in Expressed Sequence Tags: Statistical Considerations and Application to Maritime Pine Sequences , 2004, Plant Molecular Biology.

[22]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[23]  R. Varshney,et al.  Genomics-assisted breeding for crop improvement. , 2005, Trends in plant science.

[24]  David Edwards,et al.  Single nucleotide polymorphism discovery in barley using autoSNPdb. , 2009, Plant biotechnology journal.

[25]  T. Ideker,et al.  Mining SNPs from EST databases. , 1999, Genome research.

[26]  Timothy A. Erwin,et al.  SNPServer: a real-time SNP discovery tool , 2005, Nucleic Acids Res..

[27]  Albert Y. Zomaya,et al.  Fuzzy Logic , 2006, Handbook of Nature-Inspired and Innovative Computing.

[28]  E. Kabelka,et al.  Discovery of single nucleotide polymorphisms in Lycopersicon esculentum by computer aided analysis of expressed sequence tags , 2004 .