Knowledge discovery and sequence-based prediction of pandemic influenza using an integrated classification and association rule mining (CBA) algorithm

Pandemic influenza is a major concern worldwide. Availability of advanced technologies and the nucleotide sequences of a large number of pandemic and non-pandemic influenza viruses in 2009 provide a great opportunity to investigate the underlying rules of pandemic induction through data mining tools. Here, for the first time, an integrated classification and association rule mining algorithm (CBA) was used to discover the rules underpinning alteration of non-pandemic sequences to pandemic ones. We hypothesized that the extracted rules can lead to the development of an efficient expert system for prediction of influenza pandemics. To this end, we used a large dataset containing 5373 HA (hemagglutinin) segments of the 2009 H1N1 pandemic and non-pandemic influenza sequences. The analysis was carried out for both nucleotide and protein sequences. We found a number of new rules which potentially present the undiscovered antigenic sites at influenza structure. At the nucleotide level, alteration of thymine (T) at position 260 was the key discriminating feature in distinguishing non-pandemic from pandemic sequences. At the protein level, rules including I233K, M334L were the differentiating features. CBA efficiently classifies pandemic and non-pandemic sequences with high accuracy at both the nucleotide and protein level. Finding hotspots in influenza sequences is a significant finding as they represent the regions with low antibody reactivity. We argue that the virus breaks host immunity response by mutation at these spots. Based on the discovered rules, we developed the software, "Prediction of Pandemic Influenza" for discrimination of pandemic from non-pandemic sequences. This study opens a new vista in discovery of association rules between mutation points during evolution of pandemic influenza.

[1]  Keiko Sato,et al.  How to Classify Influenza A Viruses and Understand Their Severity , 2010, Open Syst. Inf. Dyn..

[2]  Tomer Hertz,et al.  Putative amino acid determinants of the emergence of the 2009 influenza A (H1N1) virus in the human population , 2011, Proceedings of the National Academy of Sciences.

[3]  Rahul Raman,et al.  Effect of D222G Mutation in the Hemagglutinin Protein on Receptor Binding, Pathogenesis and Transmissibility of the 2009 Pandemic H1N1 Influenza Virus , 2011, PloS one.

[4]  Rino Rappuoli,et al.  Vaccines, reverse vaccinology, and bacterial pathogenesis. , 2013, Cold Spring Harbor perspectives in medicine.

[5]  Yasser Kadah,et al.  Genomic signatures and associative classification of the Hemagglutinin protein for Human versus Avian versus Swine influenza A viruses , 2011, 2011 28th National Radio Science Conference (NRSC).

[6]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[7]  David B. Finkelstein,et al.  Persistent Host Markers in Pandemic and H5N1 Influenza Viruses , 2007, Journal of Virology.

[8]  Ilkka Julkunen,et al.  Minor Changes in the Hemagglutinin of Influenza A(H1N1)2009 Virus Alter Its Antigenic Properties , 2011, PloS one.

[9]  Ke Zhang,et al.  D225G mutation in hemagglutinin of pandemic influenza H1N1 (2009) virus enhances virulence in mice , 2010, Experimental biology and medicine.

[10]  Gabriele Neumann,et al.  Emergence and pandemic potential of swine-origin H1N1 influenza virus , 2009, Nature.

[11]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[12]  Zhengxin Chen,et al.  Applying machine learning techniques to classify H 1 N 1 viral strains occurring in 2009 flu pandemic , 2009 .

[13]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[14]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[15]  O. Gotoh Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. , 1996, Journal of molecular biology.

[16]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[17]  R. Rappuoli,et al.  Reverse vaccinology: a genome-based approach for vaccine development , 2002, Expert opinion on biological therapy.

[18]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[19]  Wen-Chin Chen,et al.  Increasing the effectiveness of associative classification in terms of class imbalance by using a novel pruning algorithm , 2012, Expert Syst. Appl..

[20]  Zhengxin Chen,et al.  Influenza a virus informatics: genotype-centered database and genotype annotation , 2007, Second International Multi-Symposiums on Computer and Computational Sciences (IMSCCS 2007).

[21]  A Braschi,et al.  Severe outcome of influenza A/H1N1/09v infection associated with 222G/N polymorphisms in the haemagglutinin: a multicentre study. , 2011, Clinical microbiology and infection : the official publication of the European Society of Clinical Microbiology and Infectious Diseases.

[22]  Masato Ishikawa,et al.  Comprehensive study on iterative algorithms of multiple sequence alignment , 1995, Comput. Appl. Biosci..

[23]  Dimitris Chatzidimitriou,et al.  Molecular and phylogenetic analysis of the haemagglutinin gene of pandemic influenza H1N1 2009 viruses associated with severe and fatal infections. , 2010, Virus research.

[24]  David L. Adelson,et al.  Understanding the Underlying Mechanism of HA-Subtyping in the Level of Physic-Chemical Characteristics of Protein , 2014, PloS one.

[25]  Rahul Raman,et al.  Antigenically intact hemagglutinin in circulating avian and swine influenza viruses and potential for H3N2 pandemic , 2013, Scientific Reports.

[26]  Yan Liu,et al.  Altered Receptor Specificity and Cell Tropism of D222G Hemagglutinin Mutants Isolated from Fatal Cases of Pandemic A(H1N1) 2009 Influenza Virus , 2010, Journal of Virology.

[27]  Michael W. Deem,et al.  Low-dimensional clustering detects incipient dominant influenza strain clusters , 2010, Protein engineering, design & selection : PEDS.

[28]  Zehava Grossman,et al.  Rapid Detection of Influenza A Pandemic (H1N1) 2009 Virus Neuraminidase Resistance Mutation H275Y by Real-Time Reverse Transcriptase PCR , 2010, Journal of Clinical Microbiology.

[29]  Zhengxin Chen,et al.  Influenza a virus informatics: genotype-centered database and genotype annotation , 2007 .

[30]  Jun Xu,et al.  Novel Efficient Check Node Update Implementations for Row Weight Matched Min-Sum Algorithm , 2007 .

[31]  E. D. Kilbourne Influenza Pandemics of the 20th Century , 2006, Emerging infectious diseases.

[32]  D C Torney,et al.  Discovery of association rules in medical data , 2001, Medical informatics and the Internet in medicine.

[33]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[34]  James Stevens,et al.  Structure and Receptor binding properties of a pandemic H1N1 virus hemagglutinin. , 2010, PLoS currents.

[35]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.