MISSEL: a method to identify a large number of small species-specific genomic subsequences and its application to viruses classification

BackgroundContinuous improvements in next generation sequencing technologies led to ever-increasing collections of genomic sequences, which have not been easily characterized by biologists, and whose analysis requires huge computational effort. The classification of species emerged as one of the main applications of DNA analysis and has been addressed with several approaches, e.g., multiple alignments-, phylogenetic trees-, statistical- and character-based methods.ResultsWe propose a supervised method based on a genetic algorithm to identify small genomic subsequences that discriminate among different species. The method identifies multiple subsequences of bounded length with the same information power in a given genomic region. The algorithm has been successfully evaluated through its integration into a rule-based classification framework and applied to three different biological data sets: Influenza, Polyoma, and Rhino virus sequences.ConclusionsWe discover a large number of small subsequences that can be used to identify each virus type with high accuracy and low computational time, and moreover help to characterize different genomic regions. Bounding their length to 20, our method found 1164 characterizing subsequences for all the Influenza virus subtypes, 194 for all the Polyoma viruses, and 11 for Rhino viruses. The abundance of small separating subsequences extracted for each genomic region may be an important support for quick and robust virus identification.Finally, useful biological information can be derived by the relative location and abundance of such subsequences along the different regions.

[1]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[2]  K. Deb,et al.  Reliable classification of two-class cancer data using evolutionary algorithms. , 2003, Bio Systems.

[3]  N. Pace A molecular view of microbial diversity and the biosphere. , 1997, Science.

[4]  N. J. Knowles,et al.  Ratification vote on taxonomic proposals to the International Committee on Taxonomy of Viruses (2015) , 2009, Archives of Virology.

[5]  E. Maris Estimating multiple classification latent class models , 1999 .

[6]  Mikael Bodén,et al.  MEME Suite: tools for motif discovery and searching , 2009, Nucleic Acids Res..

[7]  Mauricio G. C. Resende,et al.  Hybrid GRASP Heuristics , 2009, Foundations of Computational Intelligence.

[8]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[9]  T. A. Hall,et al.  BIOEDIT: A USER-FRIENDLY BIOLOGICAL SEQUENCE ALIGNMENT EDITOR AND ANALYSIS PROGRAM FOR WINDOWS 95/98/ NT , 1999 .

[10]  Giovanni Felici,et al.  LAF: Logic Alignment Free and its application to bacterial genomes classification , 2015, BioData Mining.

[11]  P. Hebert,et al.  Identification of Birds through DNA Barcodes , 2004, PLoS biology.

[12]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[13]  W. S. Jordan,et al.  A collaborative report: rhinoviruses--extension of the numbering system from 89 to 100. , 1987, Virology.

[14]  Gaurav Vaidya,et al.  DNA barcoding and taxonomy in Diptera: a tale of high intraspecific variability and low identification success. , 2006, Systematic biology.

[15]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[16]  Fred Glover,et al.  Tabu Search - Part II , 1989, INFORMS J. Comput..

[17]  Constantine Michailides,et al.  Optimization of a flexible floating structure for wave energy production and protection effectiveness , 2015 .

[18]  Karl Rihaczek,et al.  1. WHAT IS DATA MINING? , 2019, Data Mining for the Social Sciences.

[19]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[20]  Kishori M. Konwar,et al.  DNA-BAR: distinguisher selection for DNA barcoding , 2005, Bioinform..

[21]  Marco Dorigo,et al.  An Investigation of some Properties of an "Ant Algorithm" , 1992, PPSN.

[22]  Olivier David,et al.  DNA barcode analysis: a comparison of phylogenetic and statistical classification methods , 2009, BMC Bioinformatics.

[23]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[24]  Ting Huang,et al.  Evidence of Recombination and Genetic Diversity in Human Rhinoviruses in Children with Acute Respiratory Infection , 2009, PLoS ONE.

[25]  Sotiris B. Kotsiantis,et al.  Machine learning: a review of classification and combining techniques , 2006, Artificial Intelligence Review.

[26]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[27]  Harvey Goldstein,et al.  Multiple membership multiple classification (MMMC) models , 2001 .

[28]  Giovanni Felici,et al.  Supervised DNA Barcodes species classification: analysis, comparisons and results , 2014, BioData Mining.

[29]  J. Farris Estimating Phylogenetic Trees from Distance Matrices , 1972, The American Naturalist.

[30]  E. Bonci,et al.  Rhinovirus bronchiolitis and recurrent wheezing: 1-year follow-up , 2011, European Respiratory Journal.

[31]  P. Simmonds,et al.  Proposals for the classification of human rhinovirus species A, B and C into genotypically assigned types , 2013, The Journal of general virology.

[32]  J. Hoofnagle,et al.  Sequence analysis of hepatitis C virus from patients with relapse after a sustained virological response: relapse or reinfection? , 2014, The Journal of infectious diseases.

[33]  Giovanni Felici,et al.  Human polyomaviruses identification by logic mining techniques , 2012, Virology Journal.

[34]  Douglas B. Kell,et al.  Multiobjective Optimization in Bioinformatics and Computational Biology , 2007, IEEE ACM Trans. Comput. Biol. Bioinform..

[35]  A. Paterson,et al.  Mitochondrial COI and II provide useful markers for Wiseana (Lepidoptera: Hepialidae) species identification , 1999, Bulletin of Entomological Research.

[36]  E. Zdobnov,et al.  New complete genome sequences of human rhinoviruses shed light on their phylogeny and genomic features , 2007, BMC Genomics.

[37]  P. Simmonds,et al.  Proposals for the classification of human rhinovirus species C into genotypically assigned types. , 2010, The Journal of general virology.

[38]  Mauricio G. C. Resende,et al.  GRASP: basic components and enhancements , 2011, Telecommun. Syst..

[39]  Xiaodong Li,et al.  A Non-dominated Sorting Particle Swarm Optimizer for Multiobjective Optimization , 2003, GECCO.

[40]  Damon P. Little,et al.  DNA Barcode Sequence Identification Incorporating Taxonomic Hierarchy and within Taxon Variability , 2011, PloS one.

[41]  D. L. Nanney,et al.  Genes and Phenes in Tetrahymena , 1982 .

[42]  Indra Neil Sarkar,et al.  caos software for use in character‐based DNA barcoding , 2008, Molecular ecology resources.

[43]  C. Scagnolari,et al.  Molecular epidemiology and genetic diversity of human rhinovirus affecting hospitalized children in Rome , 2013, Medical Microbiology and Immunology.

[44]  P. Simmonds,et al.  Analysis of Genetic Diversity and Sites of Recombination in Human Rhinovirus Species C , 2010, Journal of Virology.

[45]  E. Lesho,et al.  Zinc and the common cold: a meta-analysis revisited. , 2000, The Journal of nutrition.

[46]  Dong Liang,et al.  PTIGS-IdIt, a system for species identification by DNA sequences of the psbA-trnH intergenic spacer region , 2011, BMC Bioinformatics.

[47]  Scott Kirkpatrick,et al.  Optimization by Simmulated Annealing , 1983, Sci..

[48]  Giovanni Felici,et al.  Logic classification and feature selection for biomedical data , 2008, Comput. Math. Appl..

[49]  C. Woese,et al.  Phylogenetic structure of the prokaryotic domain: The primary kingdoms , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[50]  P. Bertolazzi,et al.  BLOG 2.0: a software system for character‐based species classification with DNA Barcode sequences. What it does, how to use it , 2013, Molecular ecology resources.

[51]  P. Simmonds,et al.  Screening Respiratory Samples for Detection of Human Rhinoviruses (HRVs) and Enteroviruses: Comprehensive VP4-VP2 Typing Reveals High Incidence and Genetic Diversity of HRV Species C , 2009, Journal of Clinical Microbiology.

[52]  Yiming Bao,et al.  PAirwise Sequence Comparison (PASC) and Its Application in the Classification of Filoviruses , 2012, Viruses.

[53]  Vladimir Pavlovic,et al.  Efficient alignment-free DNA barcode analytics , 2009, BMC Bioinformatics.

[54]  Donald Seto,et al.  Classification of Myoviridae bacteriophages using protein sequence similarity , 2009, BMC Microbiology.

[55]  D. Hickey,et al.  The DNA Barcode Linker , 2011, Molecular ecology resources.

[56]  H. Aizawa,et al.  Differences in clinical features between influenza A H1N1, A H3N2, and B in adult patients , 2003, Respirology.

[57]  R. Hill,et al.  Taxonomic and systematic assessment of planktonic copepods using mitochondrial COI sequence variation and competitive, species-specific PCR , 1999, Hydrobiologia.

[58]  S. Yerly,et al.  Chronic rhinoviral infection in lung transplant recipients. , 2006, American journal of respiratory and critical care medicine.

[59]  William B. Langdon,et al.  Performance of genetic programming optimised Bowtie2 on genome comparison and analytic testing (GCAT) benchmarks , 2015, BioData Mining.

[60]  J. Moore,et al.  BioData Mining , 2017 .

[61]  Kaisa Miettinen,et al.  Nonlinear multiobjective optimization , 1998, International series in operations research and management science.

[62]  E. Lefkowitz,et al.  Ratification vote on taxonomic proposals to the International Committee on Taxonomy of Viruses (2014) , 2014, Archives of Virology.

[63]  Hitoshi Iba,et al.  Selecting informative genes using a multiobjective evolutionary algorithm , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[64]  G. Vayopoulos,et al.  Antiretroviral activity of 5-azacytidine during treatment of a HTLV-1 positive myelodysplastic syndrome with autoimmune manifestations , 2012, Virology Journal.

[65]  Darren P. Martin,et al.  A genome-wide pairwise-identity-based proposal for the classification of viruses in the genus Mastrevirus (family Geminiviridae) , 2013, Archives of Virology.

[66]  Giovanni Felici,et al.  DNA Barcoding of Recently Diverged Species: Relative Performance of Matching Methods , 2012, PloS one.

[67]  O. Ruuskanen,et al.  Human rhinovirus C—Associated severe pneumonia in a neonate , 2011, Journal of Clinical Virology.

[68]  Timothy L. Bailey,et al.  Gene expression Advance Access publication May 4, 2011 DREME: motif discovery in transcription factor ChIP-seq data , 2011 .

[69]  Fred W. Glover,et al.  Tabu Search - Part I , 1989, INFORMS J. Comput..

[70]  Ian Witten,et al.  Data Mining , 2000 .

[71]  Laurent Keller,et al.  Conflict over Male Parentage in Social Insects , 2004, PLoS biology.

[72]  Giovanni Felici,et al.  Integer programming models for feature selection: New extensions and a randomized solution algorithm , 2016, Eur. J. Oper. Res..

[73]  Wouter Boomsma,et al.  Statistical assignment of DNA sequences using Bayesian phylogenetics. , 2008, Systematic biology.