Mining Polymorphic SSRs from Individual Genome Sequences

Simple Sequence Repeats (SSRs) are abundant in genome sequences and become popular biomarkers for genetic studies. Several SSRs were proved essential for gene regulation, abnormal repeat patterns of these critical SSRs might cause lethal diseases. The Next Generation Sequencing technologies provided efficient approaches for SSR polymorphism detection. However, inefficient and manually curated processes were unavoidable for identifying SSR markers in previous approaches. An automatic and efficient system for detecting polymorphic SSRs at genomic scales was proposed without manual curated and examining works. The workflow accepted multiple NGS sequencing datasets and started with assembly by de novo or reference mapping approaches. The consensus sequences were then obtained from previously assembled contigs, and calibrated coordinates in each individual contig were aligned according to the selected reference sequences. Next, the mining SSR mechanism was designed to retrieve all potential polymorphic SSRs whenever the circumstances were occurred due to insertion or deletion mechanisms. The 1000 genomes Trio projects were employed as the testing sequence datasets, and the CODIS SSR markers and 9 well known disease-related SSR motifs were verified as the testing targets. The results have shown the proposed method could identify the known polymorphic SSRs as well as novel SSR markers when there was no sequencing or mapping errors within the consensus sequences. The proposed method employed NGS technologies to identify SSR polymorphism and accelerate related researches, which facilitates novel SSR biomarker selection and regulatory elements discovery.

[1]  Margaret Staton,et al.  CMD: a Cotton Microsatellite Database resource for Gossypium genomics , 2006, BMC Genomics.

[2]  B Budowle,et al.  CODIS STR loci data from 41 sample populations. , 2001, Journal of forensic sciences.

[3]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[4]  Horst Buerger,et al.  Distinct amplification of an untranslated regulatory sequence in the egfr gene contributes to early steps in breast cancer development. , 2003, Cancer research.

[5]  E. Nevo,et al.  Microsatellites within genes: structure, function, and evolution. , 2004, Molecular biology and evolution.

[6]  Chien-Ming Chen,et al.  Efficient algorithms for identifying orthologous simple sequence repeats of disease genes , 2010, J. Syst. Sci. Complex..

[7]  Angelika Merkel,et al.  Detecting short tandem repeats from genome data: opening the software black box , 2008, Briefings Bioinform..

[8]  Emese Meglécz,et al.  QDD: a user-friendly program to select microsatellite markers and design primers from large sequencing projects , 2010, Bioinform..

[9]  D. Severson,et al.  Genome-based polymorphic microsatellite development and validation in the mosquito Aedes aegypti and application to population genetics in Haiti , 2009, BMC Genomics.

[10]  Mario-Ubaldo Manto,et al.  The wide spectrum of spinocerebellar ataxias (SCAs) , 2008, The Cerebellum.

[11]  F. Balloux,et al.  The estimation of population differentiation with microsatellite markers , 2002, Molecular ecology.

[12]  H. Ellegren Microsatellites: simple sequences with complex evolution , 2004, Nature Reviews Genetics.

[13]  R. Richards,et al.  Fragile X syndrome unstable element, p(CCG)n, and other simple tandem repeat sequences are binding sites for specific nuclear proteins. , 1993, Human molecular genetics.

[14]  Lichun Yang,et al.  Aberrant splicing of the ATM gene associated with shortening of the intronic mononucleotide tract in human colon tumor cell lines: A novel mutation target of microsatellite instability , 2000, International journal of cancer.

[15]  Life Technologies,et al.  A map of human genome variation from population-scale sequencing , 2011 .

[16]  C. J. Chen,et al.  Hormonal markers and hepatitis B virus-related hepatocellular carcinoma risk: a nested case-control study among men. , 2001, Journal of the National Cancer Institute.

[17]  Julie D Thompson,et al.  Multiple Sequence Alignment Using ClustalW and ClustalX , 2003, Current protocols in bioinformatics.

[18]  Véronique Martin,et al.  Mapping Reads on a Genomic Sequence: An Algorithmic Overview and a Practical Comparative Analysis , 2012, J. Comput. Biol..

[19]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[20]  Daniel Rios,et al.  Ensembl 2011 , 2010, Nucleic Acids Res..

[21]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[22]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[23]  M. Kwak,et al.  Fast and Cost-Effective Mining of Microsatellite Markers Using NGS Technology: An Example of a Korean Water Deer Hydropotes inermis argyropus , 2011, PloS one.

[24]  Ju-Kyung Yu,et al.  Nonrandom distribution and frequencies of genomic and EST-derived microsatellite markers in rice, wheat, and barley , 2005, BMC Genomics.

[25]  Christian Schlötterer,et al.  The evolution of molecular markers — just a matter of fashion? , 2004, Nature Reviews Genetics.

[26]  L. Ranum,et al.  Dominantly inherited, non-coding microsatellite expansion disorders. , 2002, Current opinion in genetics & development.

[27]  D. Zacharias,et al.  Minimum CAG repeat in the human calmodulin-1 gene 5' untranslated region is required for full expression. , 1998, Biochimica et biophysica acta.

[28]  J. Jurka,et al.  Simple repetitive DNA sequences from primates: Compilation and analysis , 1995, Journal of Molecular Evolution.

[29]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[30]  J. Hoffman,et al.  A Novel Approach for Mining Polymorphic Microsatellite Markers In Silico , 2011, PloS one.

[31]  Patrik Brundin,et al.  The use of the R6 transgenic mouse models of Huntington’s disease in attempts to develop novel therapeutic strategies , 2005, NeuroRX.

[32]  I. Kanazawa,et al.  Molecular pathology of dentatorubral-pallidoluysian atrophy. , 1999, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[33]  Bairong Shen,et al.  A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies , 2011, PloS one.

[34]  M. Hayden,et al.  The relationship between trinucleotide (CAG) repeat length and clinical features of Huntington's disease , 1993, Nature Genetics.

[35]  J. Slate,et al.  Characterisation of the transcriptome of a wild great tit Parus major population by next generation sequencing , 2011, BMC Genomics.

[36]  E. Mardis The impact of next-generation sequencing technology on genetics. , 2008, Trends in genetics : TIG.