Fangorn Forest (F2): a machine learning approach to classify genes and genera in the family Geminiviridae

BackgroundGeminiviruses infect a broad range of cultivated and non-cultivated plants, causing significant economic losses worldwide. The studies of the diversity of species, taxonomy, mechanisms of evolution, geographic distribution, and mechanisms of interaction of these pathogens with the host have greatly increased in recent years. Furthermore, the use of rolling circle amplification (RCA) and advanced metagenomics approaches have enabled the elucidation of viromes and the identification of many viral agents in a large number of plant species. As a result, determining the nomenclature and taxonomically classifying geminiviruses turned into complex tasks. In addition, the gene responsible for viral replication (particularly, the viruses belonging to the genus Mastrevirus) may be spliced due to the use of the transcriptional/splicing machinery in the host cells. However, the current tools have limitations concerning the identification of introns.ResultsThis study proposes a new method, designated Fangorn Forest (F2), based on machine learning approaches to classify genera using an ab initio approach, i.e., using only the genomic sequence, as well as to predict and classify genes in the family Geminiviridae. In this investigation, nine genera of the family Geminiviridae and their related satellite DNAs were selected. We obtained two training sets, one for genus classification, containing attributes extracted from the complete genome of geminiviruses, while the other was made up to classify geminivirus genes, containing attributes extracted from ORFs taken from the complete genomes cited above. Three ML algorithms were applied on those datasets to build the predictive models: support vector machines, using the sequential minimal optimization training approach, random forest (RF), and multilayer perceptron. RF demonstrated a very high predictive power, achieving 0.966, 0.964, and 0.995 of precision, recall, and area under the curve (AUC), respectively, for genus classification. For gene classification, RF could reach 0.983, 0.983, and 0.998 of precision, recall, and AUC, respectively.ConclusionsTherefore, Fangorn Forest is proven to be an efficient method for classifying genera of the family Geminiviridae with high precision and effective gene prediction and classification. The method is freely accessible at www.geminivirus.org:8080/geminivirusdw/discoveryGeminivirus.jsp.

[1]  Darren P. Martin,et al.  Brazilian Begomovirus Populations Are Highly Recombinant, Rapidly Evolving, and Segregated Based on Geographical Location , 2013, Journal of Virology.

[2]  R. Briddon,et al.  Analysis of the nucleotide sequence of the treehopper-transmitted geminivirus, tomato pseudo-curly top virus, suggests a recombinant origin. , 1996, Virology.

[3]  Xueping Zhou,et al.  Advances in understanding begomovirus satellites. , 2013, Annual review of phytopathology.

[4]  Sylvain Arlot,et al.  A survey of cross-validation procedures for model selection , 2009, 0907.4728.

[5]  M. W Gardner,et al.  Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences , 1998 .

[6]  Kuo-Bin Li,et al.  ClustalW-MPI: ClustalW analysis using distributed and parallel computing , 2003, Bioinform..

[7]  Menglong Li,et al.  Position-specific prediction of methylation sites from sequence conservation based on information theory , 2015, Scientific Reports.

[8]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[9]  G. Vandemark,et al.  Phylogeny of geminiviruses. , 1989, The Journal of general virology.

[10]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[11]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[12]  R. Gilbertson,et al.  Characterization of a New World Monopartite Begomovirus Causing Leaf Curl Disease of Tomato in Ecuador and Peru Reveals a New Direction in Geminivirus Evolution , 2013, Journal of Virology.

[13]  G. Choi,et al.  First Report of Grapevine red blotch-associated virus on Grapevine in Korea , 2016 .

[14]  Darren P. Martin,et al.  A genome-wide pairwise-identity-based proposal for the classification of viruses in the genus Mastrevirus (family Geminiviridae) , 2013, Archives of Virology.

[15]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[16]  Jitendra Kumar,et al.  βC1 is a pathogenicity determinant: not only for begomoviruses but also for a mastrevirus , 2014, Archives of Virology.

[17]  T Paprotka,et al.  The first DNA 1-like alpha satellites in association with New World begomoviruses in natural infections. , 2010, Virology.

[18]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[19]  Darren P. Martin,et al.  Establishment of three new genera in the family Geminiviridae: Becurtovirus, Eragrovirus and Turncurtovirus , 2014, Archives of Virology.

[20]  Arvind Varsani,et al.  SDT: A Virus Classification Tool Based on Pairwise Sequence Alignment and Identity Calculation , 2014, PloS one.

[21]  M. Fuchs,et al.  Grapevine red blotch-associated virus is Present in Free-Living Vitis spp. Proximal to Cultivated Grapevines. , 2016, Phytopathology.

[22]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[23]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[24]  Shahid Mansoor,et al.  Diversity of DNA beta, a satellite molecule associated with some monopartite begomoviruses. , 2003, Virology.

[25]  Darren P. Martin,et al.  Revision of Begomovirus taxonomy based on pairwise sequence comparisons , 2015, Archives of Virology.

[26]  B. Harrison,et al.  Advances in Geminivirus Research , 1985 .

[27]  E. Rybicki,et al.  A phylogenetic and evolutionary justification for three genera of Geminiviridae , 2005, Archives of Virology.

[28]  Darren P. Martin,et al.  Alfalfa Leaf Curl Virus: an Aphid-Transmitted Geminivirus , 2015, Journal of Virology.

[29]  C. Fauquet,et al.  Recommendations for the classification and nomenclature of the DNA-β satellites of begomoviruses , 2008, Archives of Virology.

[30]  Darren P Martin,et al.  Maize streak virus: an old and complex 'emerging' pathogen. , 2010, Molecular plant pathology.

[31]  S. Clancy,et al.  RNA Splicing: Introns, Exons and Spliceosome , 2008 .

[32]  M. Boulton,et al.  Splicing features in maize streak virus virion- and complementary-sense gene expression. , 1997, The Plant journal : for cell and molecular biology.

[33]  B. L. Patil,et al.  Cassava mosaic geminiviruses: actual knowledge and perspectives. , 2009, Molecular plant pathology.

[34]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[35]  Pierre Lefeuvre,et al.  Asystasia mosaic Madagascar virus: a novel bipartite begomovirus infecting the weed Asystasia gangetica in Madagascar , 2015, Archives of Virology.

[36]  G Parrella,et al.  Typing of tomato yellow leaf curl viruses and their vector in Italy. , 2006, Communications in agricultural and applied biological sciences.

[37]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[38]  P. Mullineaux,et al.  Structure and Replication of Geminivirus Genomes , 1987, Journal of Cell Science.

[39]  B. L. Patil,et al.  Distinct evolutionary histories of the DNA-A and DNA-B components of bipartite begomoviruses , 2010, BMC Evolutionary Biology.

[40]  Shahid Mansoor,et al.  Diversity of DNA 1: a satellite-like molecule associated with monopartite begomovirus-DNA beta complexes. , 2004, Virology.

[41]  E. Fuchs,et al.  The epidemiology of Wheat dwarf virus in relation to occurrence of the leafhopper Psammotettix alienus in Middle-Germany. , 2004, Virus research.

[42]  Darren P Martin,et al.  Capulavirus and Grablovirus: two new genera in the family Geminiviridae , 2017, Archives of Virology.

[43]  Darren P. Martin,et al.  Revisiting the classification of curtoviruses based on genome-wide pairwise identity , 2014, Archives of Virology.

[44]  J. C. Faria,et al.  Variability in Geminivirus Isolates Associated with Phaseolus spp. in Brazil. , 1999, Phytopathology.

[45]  R. Briddon,et al.  Cotton leaf curl virus disease. , 2000, Virus research.

[46]  Thomas Thieme,et al.  Analysis of complete genomes of isolates of the Wheat dwarf virus from new geographical locations and descriptions of their defective forms , 2013, Virus Genes.

[47]  E. Armbrust,et al.  Genome size differentiates co-occurring populations of the planktonic diatom Ditylum brightwellii (Bacillariophyta) , 2010, BMC Evolutionary Biology.

[48]  F. Sanna,et al.  An epidemiological survey of TYLCD in southern Sardinia (Italy). , 2009, Communications in agricultural and applied biological sciences.

[49]  A. Kvarnheden,et al.  Cotton leaf curl disease - an emerging threat to cotton production worldwide. , 2013, The Journal of general virology.

[50]  Li-na Wang,et al.  Accurate in silico prediction of species-specific methylation sites based on information gain feature optimization , 2016, Bioinform..

[51]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[52]  Liam J. Revell,et al.  phytools: an R package for phylogenetic comparative biology (and other things) , 2012 .

[53]  Shahid Mansoor,et al.  Geminiviruses: masters at redirecting and reprogramming plant processes , 2013, Nature Reviews Microbiology.

[54]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[55]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[56]  Tatsuya Nagata,et al.  A simple method for cloning the complete begomovirus genome using the bacteriophage phi29 DNA polymerase. , 2004, Journal of virological methods.

[57]  R. Briddon,et al.  Subviral agents associated with plant single-stranded DNA viruses. , 2006, Virology.

[58]  Thales F. M. Carvalho,et al.  Geminivirus data warehouse: a database enriched with machine learning approaches , 2017, BMC Bioinformatics.

[59]  Rachel L. Marine,et al.  High Variety of Known and New RNA and DNA Viruses of Diverse Origins in Untreated Sewage , 2012, Journal of Virology.

[60]  Xueping Zhou,et al.  The AC5 protein encoded by Mungbean yellow mosaic India virus is a pathogenicity determinant that suppresses RNA silencing-based antiviral defenses. , 2015, The New phytologist.

[61]  H. Jeske,et al.  The induction of stromule formation by a plant DNA-virus in epidermal leaf tissues suggests a novel intra- and intercellular macromolecular trafficking route , 2012, Front. Plant Sci..

[62]  S. Mohankumar,et al.  Molecular characterization of a distinct bipartite Begomovirus species infecting ivy gourd (Coccinia grandis L.) in Tamil Nadu, India , 2016, Virus Genes.

[63]  Linda Hanley-Bowdoin,et al.  A Novel Motif in Geminivirus Replication Proteins Interacts with the Plant Retinoblastoma-Related Protein , 2004, Journal of Virology.