Detecting polymorphic regions in Arabidopsis thaliana with resequencing microarrays.

Whole-genome oligonucleotide resequencing arrays have allowed the comprehensive discovery of single nucleotide polymorphisms (SNPs) in eukaryotic genomes of moderate to large size. With this technology, the detection rate for isolated SNPs is typically high. However, it is greatly reduced when other polymorphisms are located near a SNP as multiple mismatches inhibit hybridization to arrayed oligonucleotides. Contiguous tracts of suppressed hybridization therefore typify polymorphic regions (PRs) such as clusters of SNPs or deletions. We developed a machine learning method, designated margin-based prediction of polymorphic regions (mPPR), to predict PRs from resequencing array data. Conceptually similar to hidden Markov models, the method is trained with discriminative learning techniques related to support vector machines, and accurately identifies even very short polymorphic tracts (<10 bp). We applied this method to resequencing array data previously generated for the euchromatic genomes of 20 strains (accessions) of the best-characterized plant, Arabidopsis thaliana. Nonredundantly, 27% of the genome was included within the boundaries of PRs predicted at high specificity ( approximately 97%). The resulting data set provides a fine-scale view of polymorphic sequences in A. thaliana; patterns of polymorphism not apparent in SNP data were readily detected, especially for noncoding regions. Our predictions provide a valuable resource for evolutionary genetic and functional studies in A. thaliana, and our method is applicable to similar data sets in other species. More broadly, our computational approach can be applied to other segmentation tasks related to the analysis of genomic variation.

[1]  William H. Majoros,et al.  A Comparison of Whole-Genome Shotgun-Derived Mouse Chromosome 16 and the Human Genome , 2002, Science.

[2]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[3]  D. Bartel,et al.  MicroRNAS and their regulatory roles in plants. , 2006, Annual review of plant biology.

[4]  S Rozen,et al.  Primer3 on the WWW for general users and for biologist programmers. , 2000, Methods in molecular biology.

[5]  M. Kreitman,et al.  A Genome-Wide Survey of R Gene Polymorphisms in Arabidopsis[W] , 2006, The Plant Cell Online.

[6]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[7]  Robert Giegerich,et al.  A discipline of dynamic programming over sequence data , 2004, Sci. Comput. Program..

[8]  Gunnar Rätsch,et al.  Improving the Caenorhabditis elegans Genome Annotation Using Machine Learning , 2006, PLoS Comput. Biol..

[9]  J. Dangl,et al.  Structure of the Arabidopsis RPM1 gene enabling dual specificity disease resistance , 1995, Science.

[10]  The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana , 2000, Nature.

[11]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[12]  Jian-Qun Chen,et al.  Unique Evolutionary Mechanism in R-Genes Under the Presence/Absence Polymorphism in Arabidopsis thaliana , 2006, Genetics.

[13]  T. Mitchell-Olds,et al.  A Multilocus Sequence Survey in Arabidopsis thaliana Reveals a Genome-Wide Departure From a Neutral Model of DNA Sequence Polymorphism , 2005, Genetics.

[14]  Kenneth L. McNally,et al.  Sequencing Multiple and Diverse Rice Varieties. Connecting Whole-Genome Variation with Phenotypes , 2006, Plant Physiology.

[15]  Walter Fontana,et al.  Fast folding and comparison of RNA secondary structures , 1994 .

[16]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[17]  K. Frazer,et al.  Common deletions and SNPs are in linkage disequilibrium in the human genome , 2006, Nature Genetics.

[18]  M. Gribskov,et al.  The Genome of Black Cottonwood, Populus trichocarpa (Torr. & Gray) , 2006, Science.

[19]  S. P. Fodor,et al.  Blocks of Limited Haplotype Diversity Revealed by High-Resolution Scanning of Human Chromosome 21 , 2001, Science.

[20]  Li-li Chen,et al.  A Receptor Kinase-Like Protein Encoded by the Rice Disease Resistance Gene, Xa21 , 1995, Science.

[21]  Ryan E. Mills,et al.  An initial map of insertion and deletion (INDEL) variation in the human genome. , 2006, Genome research.

[22]  Detlef Weigel,et al.  Recombination and linkage disequilibrium in Arabidopsis thaliana , 2007, Nature Genetics.

[23]  M. Daly,et al.  Segmental phylogenetic relationships of inbred mouse strains revealed by fine-scale analysis of sequence variation across 4.6 mb of mouse genome. , 2004, Genome research.

[24]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[25]  Richard M. Clark,et al.  The Evolution of Selfing in Arabidopsis thaliana , 2007, Science.

[26]  J. Shendure,et al.  Advanced sequencing technologies: methods and goals , 2004, Nature Reviews Genetics.

[27]  Mattias Jakobsson,et al.  The Pattern of Polymorphism in Arabidopsis thaliana , 2005, PLoS biology.

[28]  B. Gaut,et al.  Molecular population genetics and the search for adaptive evolution in plants. , 2005, Molecular biology and evolution.

[29]  Gunnar Rätsch,et al.  Large Scale Hidden Semi-Markov SVMs , 2006, NIPS.

[30]  W. Gish,et al.  Rapid gene mapping in Caenorhabditis elegans using a high density polymorphism map , 2001, Nature Genetics.

[31]  Koby Crammer,et al.  Global Discriminative Learning for Higher-Accuracy Computational Gene Prediction , 2007, PLoS Comput. Biol..

[32]  Stijn van Dongen,et al.  miRBase: microRNA sequences, targets and gene nomenclature , 2005, Nucleic Acids Res..

[33]  M. Grant,et al.  Independent deletions of a pathogen-resistance gene in Brassica and Arabidopsis. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[34]  Geoffrey B. Nilsen,et al.  Whole-Genome Patterns of Common DNA Variation in Three Human Populations , 2005, Science.

[35]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[36]  A Chakravarti,et al.  High-throughput variation detection and genotyping using microarrays. , 2001, Genome research.

[37]  Curtis E. Dyreson,et al.  Genome analysis Athena : a resource for rapid visualization and systematic analysis of Arabidopsis promoter sequences , 2005 .

[38]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[39]  M. Eisenstein,et al.  Moving forward in reverse , 2005, Nature Methods.

[40]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[41]  Jonathan D. G. Jones,et al.  The plant immune system , 2006, Nature.

[42]  Kenneth O. Kortanek,et al.  Semi-Infinite Programming: Theory, Methods, and Applications , 1993, SIAM Rev..

[43]  Richard M. Clark,et al.  Common Sequence Polymorphisms Shaping Genetic Diversity in Arabidopsis thaliana , 2007, Science.

[44]  James L. Winkler,et al.  Accessing Genetic Information with High-Density DNA Arrays , 1996, Science.

[45]  Jason S. Cumbie,et al.  High-Throughput Sequencing of Arabidopsis microRNAs: Evidence for Frequent Birth and Death of MIRNA Genes , 2007, PloS one.

[46]  nhnguyen,et al.  Comparisons of Sequence Labeling Algorithms and Extensions , 2007 .

[47]  P Sham,et al.  A SNP resource for human chromosome 22: extracting dense clusters of SNPs from the genomic sequence. , 2001, Genome research.

[48]  James H. Thomas Adaptive evolution in two large families of ubiquitin-ligase adapters in nematodes and plants. , 2006, Genome research.

[49]  Detlef Weigel,et al.  Genome-wide patterns of single-feature polymorphism in Arabidopsis thaliana , 2007, Proceedings of the National Academy of Sciences.

[50]  Gunnar Rätsch,et al.  PALMA: mRNA to genome alignments using large margin algorithms , 2007, Bioinform..

[51]  Brian D Athey,et al.  Guidelines for incorporating non-perfectly matched oligonucleotides into target-specific hybridization probes for a DNA microarray. , 2004, Nucleic acids research.

[52]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[53]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[54]  J. Vrebalov,et al.  S Locus Genes and the Evolution of Self-Fertility in Arabidopsis thaliana[W] , 2007, The Plant Cell Online.

[55]  Thomas Hofmann,et al.  Hidden Markov Support Vector Machines , 2003, ICML.

[56]  T. Mitchell-Olds,et al.  Large-scale identification and analysis of genome-wide single-nucleotide polymorphisms for mapping in Arabidopsis thaliana. , 2003, Genome research.

[57]  Uwe Ohler,et al.  Transcriptional and posttranscriptional regulation of transcription factor expression in Arabidopsis roots. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[58]  D. Blanchard Moving forward in reverse , 2005 .

[59]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[60]  Eleazar Eskin,et al.  A sequence-based variation map of 8.27 million SNPs in inbred mouse strains , 2007, Nature.

[61]  Keyan Zhao,et al.  A Nonparametric Test Reveals Selection for Rapid Flowering in the Arabidopsis Genome , 2006, PLoS biology.

[62]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[63]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[64]  D. Bartel,et al.  A diverse and evolutionarily fluid set of microRNAs in Arabidopsis thaliana. , 2006, Genes & development.

[65]  Gunnar Rätsch,et al.  Sparse Regression Ensembles in Infinite and Finite Hypothesis Spaces , 2002, Machine Learning.

[66]  Christian E. V. Storm,et al.  Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. , 2001, Journal of molecular biology.

[67]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[68]  Joseph R. Ecker,et al.  Moving forward in reverse: genetic technologies to enable genome-wide phenomic screens in Arabidopsis , 2006, Nature Reviews Genetics.

[69]  S. Gebauer-Jung,et al.  Partial Shotgun Sequencing of the Boechera stricta Genome Reveals Extensive Microsynteny and Promoter Conservation with Arabidopsis1[W] , 2006, Plant Physiology.

[70]  R. Amasino,et al.  Molecular analysis of FRIGIDA, a major determinant of natural variation in Arabidopsis flowering time. , 2000, Science.

[71]  Lawrence K. Saul,et al.  Large Margin Hidden Markov Models for Automatic Speech Recognition , 2006, NIPS.

[72]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.