Machine Learning Based Classification of Microsatellite Variation: An Effective Approach for Phylogeographic Characterization of Olive Populations

Finding efficient analytical techniques is overwhelmingly turning into a bottleneck for the effectiveness of large biological data. Machine learning offers a novel and powerful tool to advance classification and modeling solutions in molecular biology. However, these methods have been less frequently used with empirical population genetics data. In this study, we developed a new combined approach of data analysis using microsatellite marker data from our previous studies of olive populations using machine learning algorithms. Herein, 267 olive accessions of various origins including 21 reference cultivars, 132 local ecotypes, and 37 wild olive specimens from the Iranian plateau, together with 77 of the most represented Mediterranean varieties were investigated using a finely selected panel of 11 microsatellite markers. We organized data in two ‘4-targeted’ and ‘16-targeted’ experiments. A strategy of assaying different machine based analyses (i.e. data cleaning, feature selection, and machine learning classification) was devised to identify the most informative loci and the most diagnostic alleles to represent the population and the geography of each olive accession. These analyses revealed microsatellite markers with the highest differentiating capacity and proved efficiency for our method of clustering olive accessions to reflect upon their regions of origin. A distinguished highlight of this study was the discovery of the best combination of markers for better differentiating of populations via machine learning models, which can be exploited to distinguish among other biological populations.

[1]  Z. Noormohammadi,et al.  Intra-specific genetic diversity in wild olives (Olea europaea ssp cuspidata) in Hormozgan Province, Iran. , 2012, Genetics and molecular research : GMR.

[2]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[3]  Alberto Casagrande,et al.  A consensus list of microsatellite markers for olive genotyping , 2009, Molecular Breeding.

[4]  Annalisa Imperato,et al.  Worldwide Core Collection of Olive Cultivars Based on Simple Sequence Repeat and Morphological Markers , 2012 .

[5]  M. Dave,et al.  An Empirical Comparison Of Supervised Learning Processes , 2007 .

[6]  Amir Hossein KayvanJoo,et al.  Precision assessment of some supervised and unsupervised algorithms for genotype discrimination in the genus Pisum using SSR molecular data. , 2015, Journal of theoretical biology.

[7]  R. Testolin,et al.  Comparative study of the discriminating capacity of RAPD, AFLP and SSR markers and of their effectiveness in establishing genetic relationships in olive , 2003, Theoretical and Applied Genetics.

[8]  A. Kilian,et al.  Developing a core collection of olive (Olea europaea L.) based on molecular markers (DArTs, SSRs, SNPs) and agronomic traits , 2011, Tree Genetics & Genomes.

[9]  P. Martins-Lopes,et al.  Olive Tree Genetic Resources Characterization Through Molecular Markers , 2012 .

[10]  Harry Zhang,et al.  The Optimality of Naive Bayes , 2004, FLAIRS.

[11]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[12]  P. Smouse,et al.  genalex 6: genetic analysis in Excel. Population genetic software for teaching and research , 2006 .

[13]  A. Blanco,et al.  SSR-based identification key of cultivars of Olea europaea L. diffused in Southern-Italy , 2009 .

[14]  A. Belaj,et al.  Microsatellite markers are powerful tools for discriminating among olive cultivars and assigning them to geographically defined populations. , 2006, Genome.

[15]  Chitra Nasa,et al.  Evaluation of Different Classification Techniques for WEB Data , 2012 .

[16]  P. Langley Selection of Relevant Features in Machine Learning , 1994 .

[17]  Roberto Mariotti,et al.  Identification of new polymorphic regions and differentiation of cultivated olives (Olea europaea L.) through plastome sequence comparison , 2010, BMC Plant Biology.

[18]  A. Belaj,et al.  Genetic structure of wild and cultivated olives in the central Mediterranean basin. , 2006, Annals of botany.

[19]  M. Naghavi,et al.  GENETIC AND MORPHOLOGICAL VARIATION IN IRANIAN OLIVE (OLEA EUROPAEA L.) GERMPLASM , 2013 .

[20]  M. Ebrahimi,et al.  A New Avenue for Classification and Prediction of Olive Cultivars Using Supervised and Unsupervised Algorithms , 2012, PloS one.

[21]  Carl Kingsford,et al.  What are decision trees? , 2008, Nature Biotechnology.

[22]  J. Jakše,et al.  DNA fingerprinting of olive varieties in Istria (Croatia) by microsatellite markers , 2008 .

[23]  Sorin Draghici,et al.  Machine Learning and Its Applications to Biology , 2007, PLoS Comput. Biol..

[24]  Rafael Rubio de Casas,et al.  Plastid and nuclear DNA polymorphism reveals historical processes of isolation and reticulation in the olive tree complex (Olea europaea) , 2007 .

[25]  G. Besnard,et al.  Primary domestication and early uses of the emblematic olive tree: palaeobotanical, historical and molecular evidence from the Middle East , 2012, Biological reviews of the Cambridge Philosophical Society.

[26]  David L. Adelson,et al.  Understanding the Underlying Mechanism of HA-Subtyping in the Level of Physic-Chemical Characteristics of Protein , 2014, PloS one.

[27]  Vili Podgorelec,et al.  Decision trees , 2018, Encyclopedia of Database Systems.

[28]  Z. Noormohammadi,et al.  Identification and Classification of Main Iranian Olive Cultivars Using Microsatellite Markers , 2007 .

[29]  W F Punch,et al.  Comparisons of likelihood and machine learning methods of individual classification. , 2002, The Journal of heredity.

[30]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[31]  Pedro M. Domingos,et al.  Learning Bayesian network classifiers by maximizing conditional likelihood , 2004, ICML.

[32]  Pilar Hernández,et al.  Genomic profiling of plastid DNA variation in the Mediterranean olive tree , 2011, BMC Plant Biology.

[33]  Rod Peakall,et al.  GenAlEx 6.5: genetic analysis in Excel. Population genetic software for teaching and research—an update , 2012, Bioinform..

[34]  Z. Noormohammadi,et al.  Study of intracultivar variation among main Iranian olive cultivars using SSR markers. , 2009 .

[35]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[36]  Joseph Schlecht,et al.  Machine-Learning Approaches for Classifying Haplogroup from Y Chromosome STR Data , 2008, PLoS Comput. Biol..

[37]  Matthew West,et al.  Bayesian factor regression models in the''large p , 2003 .

[38]  LarrañagaPedro,et al.  A review of feature selection techniques in bioinformatics , 2007 .

[39]  Ron Kohavi,et al.  Data mining tasks and methods: Classification: decision-tree discovery , 2002 .

[40]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[41]  J. Caballero,et al.  EVALUATION OF OLIVE GERMPLASM IN IRAN ON THE BASIS OF MORPHOLOGICAL TRAITS: ASSESSMENT OF 'ZARD' AND 'ROWGHANI' CULTIVARS , 2004 .

[42]  Pedro M. Domingos,et al.  Tree Induction for Probability-Based Ranking , 2003, Machine Learning.

[43]  Y. Zhao,et al.  Comparison of decision tree methods for finding active objects , 2007, 0708.4274.

[44]  B. Torkzaban,et al.  High Genetic Diversity Detected in Olives beyond the Boundaries of the Mediterranean Sea , 2014, PloS one.

[45]  M. Mardi,et al.  Microsatellite markers based assessment of genetic diversity in Iranian olive (Olea europaea L.) collections , 2007 .

[46]  Casimir A. Kulikowski,et al.  Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning and Expert Systems , 1990 .

[47]  Paul Vossen,et al.  Olive Oil: History, Production, and Characteristics of the World's Classic Oils , 2007 .