Predicting Influenza A Viral Host Using PSSM and Word Embeddings

The rapid mutation of influenza virus threatens public health. Reassortment among viruses with different hosts can lead to a fatal pandemic. However, it is difficult to detect the original host of the virus during or after an outbreak as influenza viruses can circulate between different species. Therefore, early and rapid detection of the viral host would help reduce the further spread of the virus. We use various machine learning models with features derived from the position-specific scoring matrix (PSSM) and features learned from word embedding and word encoding to infer the origin host of viruses. The results show that the performance of the PSSM-based model reaches the MCC around 95%, and the F1, around 96%. The MCC obtained using the model with word embedding is around 96%, and the F1 is around 97%.

[1]  Taghi M. Khoshgoftaar,et al.  RUSBoost: Improving classification performance when training data is skewed , 2008, 2008 19th International Conference on Pattern Recognition.

[3]  J. Taubenberger,et al.  Influenza virus evolution, host adaptation, and pandemic formation. , 2010, Cell host & microbe.

[4]  Gavin C. Cawley,et al.  On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation , 2010, J. Mach. Learn. Res..

[5]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[6]  Robert Gove,et al.  Machine Learning and Event-Based Software Testing: Classifiers for Identifying Infeasible GUI Event Sequences , 2012, Adv. Comput..

[7]  Dariusz M Plewczynski,et al.  The structural variability of the influenza A hemagglutinin receptor-binding site , 2017, Briefings in functional genomics.

[8]  I. Brown The pig as an intermediate host for influenza A viruses between birds and humans , 2001 .

[9]  R. Webster,et al.  Diversity of influenza viruses in swine and the emergence of a novel human pandemic influenza A (H1N1) , 2009, Influenza and other respiratory viruses.

[10]  Yan Li,et al.  A protein structural classes prediction method based on predicted secondary structure and PSI-BLAST profile. , 2014, Biochimie.

[11]  E. D. Kilbourne Influenza Pandemics of the 20th Century , 2006, Emerging infectious diseases.

[12]  Zhiqiang Duan,et al.  Isolation and characterization of two H5N1 influenza viruses from swine in Jiangsu Province of China , 2013, Archives of Virology.

[13]  S F Altschul,et al.  Iterated profile searches with PSI-BLAST--a tool for discovery in protein databases. , 1998, Trends in biochemical sciences.

[14]  Vijayakumar Saravanan,et al.  Harnessing Computational Biology for Exact Linear B-Cell Epitope Prediction: A Novel Amino Acid Composition-Based Feature Descriptor. , 2015, Omics : a journal of integrative biology.

[15]  D. Ricke,et al.  Predicting Influenza A Tropism with End-to-End Learning of Deep Networks. , 2019, Health security.

[16]  Jonathan Dushoff,et al.  Ecology and evolution of the flu , 2002 .

[17]  Wenbo Liu,et al.  Isolation and phylogenetic analysis of pandemic H1N1/09 influenza virus from swine in Jiangsu province of China. , 2012, Research in veterinary science.

[18]  Jun Wang,et al.  Reduction of protein sequence complexity by residue grouping. , 2003, Protein engineering.

[19]  Ron A M Fouchier,et al.  Antigenic and Genetic Characteristics of Swine-Origin 2009 A(H1N1) Influenza Viruses Circulating in Humans , 2009, Science.

[20]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[21]  S. Cui,et al.  Human infection with H9N2 avian influenza in northern China. , 2017, Clinical microbiology and infection : the official publication of the European Society of Clinical Microbiology and Infectious Diseases.

[22]  W. J. Bean,et al.  Evolution of the nucleoprotein gene of influenza A virus , 1990, Journal of virology.

[23]  Anthony S Fauci,et al.  The 1918 influenza pandemic: insights for the 21st century. , 2007, The Journal of infectious diseases.

[24]  Fayroz F. Sherif,et al.  Classification of Host Origin in Influenza A virus by Transferring Protein Sequences into Numerical Feature Vectors , 2017 .

[25]  J. Pasick,et al.  Molecular and Antigenic Characterization of Reassortant H3N2 Viruses from Turkeys with a Unique Constellation of Pandemic H1N1 Internal Genes , 2012, PloS one.

[26]  A one-letter notation for amino acid sequences. , 1972, Pure and applied chemistry. Chimie pure et appliquee.

[27]  Gavin J. D. Smith,et al.  Origins and evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic , 2009, Nature.

[28]  M. Marz,et al.  VIDHOP, viral host prediction with Deep Learning , 2020, Bioinform..

[29]  Xiaoqi Zheng,et al.  Prediction of protein structural class for low-similarity sequences using support vector machine and PSI-BLAST profile. , 2010, Biochimie.

[30]  Zhengxin Chen,et al.  Applying neural networks to classify influenza virus antigenic types and hosts , 2010, 2010 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[31]  A. Sami,et al.  Novel approach for identification of influenza virus host range and zoonotic transmissible sequences by determination of host-related associative positions in viral genome segments , 2016, BMC Genomics.

[32]  Molecular and antigenic characterization of triple-reassortant H3N2 swine influenza viruses isolated from pigs, turkey and quail in Canada. , 2011, Transboundary and emerging diseases.

[33]  Wendy S. Barclay,et al.  Host and viral determinants of influenza A virus species specificity , 2018, Nature Reviews Microbiology.

[34]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[35]  M. Hood,et al.  Molecular virology: Was the 1918 flu avian in origin? , 2006, Nature.

[36]  Bhupal Singh Influenza , 1916, Nature Reviews Disease Primers.

[37]  R. Webster,et al.  Evolution and ecology of influenza A viruses. , 1992, Current topics in microbiology and immunology.

[38]  Tin Wee Tan,et al.  Predicting host tropism of influenza A virus proteins using random forest , 2014, BMC Medical Genomics.

[39]  Niall Johnson,et al.  Updating the Accounts: Global Mortality of the 1918-1920 "Spanish" Influenza Pandemic , 2002, Bulletin of the history of medicine.

[40]  Michael Worobey,et al.  A synchronized global sweep of the internal genes of modern avian influenza virus , 2014, Nature.

[41]  P. Spreeuwenberg,et al.  Reassessing the Global Mortality Burden of the 1918 Influenza Pandemic , 2018, American journal of epidemiology.

[42]  Yi Guan,et al.  Dating the emergence of pandemic influenza viruses , 2009, Proceedings of the National Academy of Sciences.

[43]  P. Yadav,et al.  Laboratory-Confirmed Avian Influenza A(H9N2) Virus Infection, India, 2019 , 2019, Emerging infectious diseases.

[44]  Jeffery K. Taubenberger,et al.  Characterization of the 1918 influenza virus polymerase genes , 2005, Nature.