An application of kernel methods to variety identification based on SSR markers genetic fingerprinting

BackgroundIn crop production systems, genetic markers are increasingly used to distinguish individuals within a larger population based on their genetic make-up. Supervised approaches cannot be applied directly to genotyping data due to the specific nature of those data which are neither continuous, nor nominal, nor ordinal but only partially ordered. Therefore, a strategy is needed to encode the polymorphism between samples such that known supervised approaches can be applied. Moreover, finding a minimal set of molecular markers that have optimal ability to discriminate, for example, between given groups of varieties, is important as the genotyping process can be costly in terms of laboratory consumables, labor, and time. This feature selection problem also needs special care due to the specific nature of the data used.ResultsAn approach encoding SSR polymorphisms in a positive definite kernel is presented, which then allows the usage of any kernel supervised method. The polymorphism between the samples is encoded through the Nei-Li genetic distance, which is shown to define a positive definite kernel between the genotyped samples. Additionally, a greedy feature selection algorithm for selecting SSR marker kits is presented to build economical and efficient prediction models for discrimination. The algorithm is a filter method and outperforms other filter methods adapted to this setting. When combined with kernel linear discriminant analysis or kernel principal component analysis followed by linear discriminant analysis, the approach leads to very satisfactory prediction models.ConclusionsThe main advantage of the approach is to benefit from a flexible way to encode polymorphisms in a kernel and when combined with a feature selection algorithm resulting in a few specific markers, it leads to accurate and economical identification models based on SSR genotyping.

[1]  Cécile Fizames,et al.  A comprehensive genetic map of the human genome based on 5,264 microsatellites , 1996, Nature.

[2]  Andreas Graner,et al.  Genic microsatellite markers in plants: features and applications. , 2005, Trends in biotechnology.

[3]  K. Mullis,et al.  Primer-directed enzymatic amplification of DNA with a thermostable DNA polymerase. , 1988, Science.

[4]  Jian Yang,et al.  KPCA plus LDA: a complete kernel Fisher discriminant framework for feature extraction and recognition , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  S. Amstrup,et al.  Genetic structure of the world’s polar bear populations , 1999, Molecular ecology.

[6]  P. Cregan,et al.  The use of microsatellite DNA markers for soybean genotype identification , 2004, Theoretical and Applied Genetics.

[7]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[8]  A. Jeffreys,et al.  DNA Fingerprinting: State of the Science , 1993, Progress in Systems and Control Theory.

[9]  B. Scholkopf,et al.  Fisher discriminant analysis with kernels , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[10]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[11]  J. L. Weber,et al.  Survey of plant short tandem DNA repeats , 1994, Theoretical and Applied Genetics.

[12]  M. Blair,et al.  Microsatellite marker development, mapping and applications in rice genetics and breeding , 1997, Plant Molecular Biology.

[13]  W. Powell,et al.  Discriminating between barley genotypes using microsatellite markers. , 1997, Genome.

[14]  J. Epplen,et al.  On the essence of "meaningless" simple repetitive DNA in eukaryote genomes. , 1993, EXS.

[15]  D. Tautz,et al.  Simple sequences are ubiquitous repetitive components of eukaryotic genomes. , 1984, Nucleic acids research.

[16]  M. Hudcovicová,et al.  Microsatellite markers discriminating accessions within collections of plant genetic resources. , 2002, Cellular & molecular biology letters.

[17]  M. Morgante,et al.  A simple sequence repeat-based linkage map of barley. , 2000, Genetics.

[18]  B. Ghareyazie,et al.  Classification of rice germplasm. I. Analysis using ALP and PCR-based RFLP , 1995, Theoretical and Applied Genetics.

[19]  Sovan Lek,et al.  Microsatellites and artificial neural networks: tools for the discrimination between natural and hatchery brown trout (Salmo trutta, L.) in Atlantic populations , 1999 .

[20]  W. Powell,et al.  Use of microsatellite DNA markers to investigate the level of genetic diversity and population genetic structure of coconut (Cocos nucifera L.). , 2000, Genome.

[21]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[22]  M. Ganal,et al.  A microsatellite marker based linkage map of tobacco , 2006, Theoretical and Applied Genetics.

[23]  M. Nei,et al.  Mathematical model for studying genetic variation in terms of restriction endonucleases. , 1979, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[25]  Qifa Zhang,et al.  A diallel analysis of heterosis in elite hybrid rice based on RFLPs and microsatellites , 1994, Theoretical and Applied Genetics.

[26]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[27]  Jonathan A. Marshall,et al.  An introduction to neural and electronic networks: Edited by Steven F. Zornetzer, Joel L. Davis, and Clifford Lau, Academic Press, San Diego, CA: 1990, hardcover $99.50, paperback $44.95, 493 pp., ISBN 0-12-781881-2 , 1992 .

[28]  Einar Eg Nielsen,et al.  Assigning individual fish to populations using microsatellite DNA markers , 2001 .

[29]  R. Lande,et al.  Efficiency of marker-assisted selection in the improvement of quantitative traits. , 1990, Genetics.

[30]  E. Nevo,et al.  Microsatellites: genomic distribution, putative functions and mutational mechanisms: a review , 2002, Molecular ecology.

[31]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  A. Jeffreys,et al.  DNA fingerprints applied to gene introgression in breeding programs. , 1990, Genetics.