A high performance prediction of HPV genotypes by Chaos game representation and singular value decomposition

BackgroundHuman Papillomavirus (HPV) genotyping is an important approach to fight cervical cancer due to the relevant information regarding risk stratification for diagnosis and the better understanding of the relationship of HPV with carcinogenesis. This paper proposed two new feature extraction techniques, i.e. ChaosCentroid and ChaosFrequency, for predicting HPV genotypes associated with the cancer. The additional diversified 12 HPV genotypes, i.e. types 6, 11, 16, 18, 31, 33, 35, 45, 52, 53, 58, and 66, were studied in this paper.In our proposed techniques, a partitioned Chaos Game Representation (CGR) is deployed to represent HPV genomes. ChaosCentroid captures the structure of sequences in terms of centroid of each sub-region with Euclidean distances among the centroids and the center of CGR as the relations of all sub-regions. ChaosFrequency extracts the statistical distribution of mono-, di-, or higher order nucleotides along HPV genomes and forms a matrix of frequency of dots in each sub-region. For performance evaluation, four different types of classifiers, i.e. Multi-layer Perceptron, Radial Basis Function, K-Nearest Neighbor, and Fuzzy K-Nearest Neighbor Techniques were deployed, and our best results from each classifier were compared with the NCBI genotyping tool.ResultsThe experimental results obtained by four different classifiers are in the same trend. ChaosCentroid gave considerably higher performance than ChaosFrequency when the input length is one but it was moderately lower than ChaosFrequency when the input length is two. Both proposed techniques yielded almost or exactly the best performance when the input length is more than three. But there is no significance between our proposed techniques and the comparative alignment method.ConclusionsOur proposed alignment-free and scale-independent method can successfully transform HPV genomes with 7,000 - 10,000 base pairs into features of 1 - 11 dimensions. This signifies that our ChaosCentroid and ChaosFrequency can be served as the effective feature extraction techniques for predicting the HPV genotypes.

[1]  K. Chou,et al.  iCTX-Type: A Sequence-Based Predictor for Identifying the Types of Conotoxins in Targeting Ion Channels , 2014, BioMed research international.

[2]  Byoung-Tak Zhang,et al.  Ensembled support vector machines for human papillomavirus risk type prediction from protein secondary structures , 2009, Comput. Biol. Medicine.

[3]  Byoung-Tak Zhang,et al.  Classification of Human Papillomavirus (HPV) Risk Type via Text Mining , 2003 .

[4]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[5]  Kuo-Chen Chou,et al.  Cellular automata and its applications in protein bioinformatics. , 2011, Current protein & peptide science.

[6]  Z. Lachiri,et al.  Genomic data visualization , 2012, 2012 6th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT).

[7]  Zu-Guo Yu,et al.  Chaos game representation of protein sequences based on the detailed HP model and their multifractal and correlation analyses. , 2004, Journal of theoretical biology.

[8]  Xiaolong Wang,et al.  Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection , 2013, Bioinform..

[9]  J. A. Berger,et al.  Jointly analyzing gene expression and copy number data in breast cancer using data reduction models , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10]  K. Chou,et al.  iMethyl-PseAAC: Identification of Protein Methylation Sites via a Pseudo Amino Acid Composition Approach , 2014, BioMed research international.

[11]  Amr El Abbadi,et al.  Efficient filtration of sequence similarity search through singular value decomposition , 2004, Proceedings. Fourth IEEE Symposium on Bioinformatics and Bioengineering.

[12]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[13]  Florent Baty,et al.  Exploring the transcription factor activity in high-throughput gene expression data using RLQ analysis , 2013, BMC Bioinformatics.

[14]  Wei Chen,et al.  iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition , 2014, Bioinform..

[15]  K. Chou,et al.  iNitro-Tyr: Prediction of Nitrotyrosine Sites in Proteins with General Pseudo Amino Acid Composition , 2014, PloS one.

[16]  Xuan Xiao,et al.  A new approach using geometric moments of distance matrix image for risk type prediction of human papillomaviruses , 2011, 2011 International Conference on Electronics, Communications and Control (ICECC).

[17]  Byoung-Tak Zhang,et al.  Human Papillomavirus Risk Type Classification from Protein Sequences Using Support Vector Machines , 2006, EvoWorkshops.

[18]  K. Chou,et al.  iHyd-PseAAC: Predicting Hydroxyproline and Hydroxylysine in Proteins by Incorporating Dipeptide Position-Specific Propensity into Pseudo Amino Acid Composition , 2014, International journal of molecular sciences.

[19]  Xuan Xiao and Kuo-Chen Chou Using Pseudo Amino Acid Composition to Predict Protein Attributes Via Cellular Automata and Other Approaches , 2011 .

[20]  H. J. Jeffrey Chaos game representation of gene structure. , 1990, Nucleic acids research.

[21]  Xuehai Hu,et al.  Chaos Game Representation for Discriminating Thermophilic from Mesophilic Protein Sequences , 2009, 2009 3rd International Conference on Bioinformatics and Biomedical Engineering.

[22]  F. Gimenes,et al.  A review of methods for detect human Papillomavirus infection , 2012, Virology Journal.

[23]  Gabriella Guasticchi,et al.  Distribution of high and low risk HPV types by cytological status: a population based study from Italy , 2011, Infectious Agents and Cancer.

[24]  Sin Hang Lee,et al.  Routine human papillomavirus genotyping by DNA sequencing in community hospital laboratories , 2007, Infectious Agents and Cancer.

[25]  BMC Bioinformatics , 2005 .

[26]  Kuo-Chen Chou,et al.  GPCR-2L: predicting G protein-coupled receptors and their types by hybridizing two different modes of pseudo amino acid compositions. , 2011, Molecular bioSystems.

[27]  M. Wachowiak,et al.  Microarray Image Compression Using a Variation of Singular Value Decomposition , 2007, 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[28]  Hui Jiang,et al.  Gene network modular-based classification of microarray samples , 2012, BMC Bioinformatics.

[29]  H. Lehmann Theoretical and Experimental Biology , 1968 .

[30]  Jonas S. Almeida,et al.  Analysis of genomic sequences by Chaos Game Representation , 2001, Bioinform..

[31]  H. Mohabatkar,et al.  Predicting anticancer peptides with Chou's pseudo amino acid composition and investigating their mutagenicity via Ames test. , 2014, Journal of theoretical biology.

[32]  Byoung-Tak Zhang,et al.  Classification of the Risk Types of Human Papillomavirus by Decision Trees , 2003, IDEAL.

[33]  B. Liu,et al.  iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance-Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition , 2014, PloS one.

[34]  Tatiana A. Tatusova,et al.  A web-based genotyping resource for viral sequences , 2004, Nucleic Acids Res..

[35]  V. Anh,et al.  Chaos game representation of functional protein sequences, and simulation and multifractal analysis of induced measures , 2010 .

[36]  José Nélio Januário,et al.  Comparison of HPV genotyping by type-specific PCR and sequencing. , 2010, Memorias do Instituto Oswaldo Cruz.

[37]  D. Nemescu,et al.  HPV prevalence and type distribution in women with or without cervical lesions in the Northeast region of Romania , 2011, Virology Journal.

[38]  Kuo-Chen Chou,et al.  Some remarks on predicting multi-label attributes in molecular biosystems. , 2013, Molecular bioSystems.

[39]  Zu-Guo Yu,et al.  Chaos Game Representation of Genomes and their Simulation by Recurrent Iterated Function Systems , 2008, 2008 2nd International Conference on Bioinformatics and Biomedical Engineering.

[40]  Filip Zelezný,et al.  Comparative evaluation of set-level techniques in predictive classification of gene expression samples , 2012, BMC Bioinformatics.

[41]  K. Chou,et al.  iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. , 2012, Molecular bioSystems.

[42]  Guo-Ping Zhou The disposition of the LZCC protein residues in wenxiang diagram provides new insights into the protein–protein interaction mechanism , 2011, Journal of Theoretical Biology.

[43]  K. Chou,et al.  iLoc-Euk: A Multi-Label Classifier for Predicting the Subcellular Localization of Singleplex and Multiplex Eukaryotic Proteins , 2011, PloS one.

[44]  P. Deschavanne,et al.  Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. , 1999, Molecular biology and evolution.

[45]  Chaohong Song,et al.  Subcellular location of apoptosis proteins based on chaos game representation , 2009, 2009 International Conference on Future BioMedical Information Engineering (FBIE).

[46]  Zu-Guo Yu,et al.  Clustering structures of large proteins using multifractal analyses based on a 6-letter model and hydrophobicity scale of amino acids , 2009 .

[47]  Iman Tavassoly,et al.  Three Dimensional Chaos Game Representation of Genomic Sequences , 2007, 2007 Frontiers in the Convergence of Bioscience and Information Technologies.

[48]  Xuehai Hu,et al.  Predicting Thermophilic Nucleotide Sequences Based on Chaos Game Representation Features and Support Vector Machine , 2011, 2011 5th International Conference on Bioinformatics and Biomedical Engineering.

[49]  K. Chou,et al.  iEzy-Drug: A Web Server for Identifying the Interaction between Enzymes and Drugs in Cellular Networking , 2013, BioMed research international.

[50]  C Dutta,et al.  Mathematical characterization of Chaos Game Representation. New algorithms for nucleotide sequence analysis. , 1992, Journal of molecular biology.

[51]  K. Chou,et al.  iUbiq-Lys: prediction of lysine ubiquitination sites in proteins by extracting sequence evolution information via a gray system model , 2015, Journal of biomolecular structure & dynamics.

[52]  Kuo-Chen Chou,et al.  iNR-Drug: Predicting the Interaction of Drugs with Nuclear Receptors in Cellular Networking , 2014, International journal of molecular sciences.

[53]  Lisa Maher,et al.  Cervical human papillomavirus infection among young women engaged in sex work in Phnom Penh, Cambodia: prevalence, genotypes, risk factors and association with HIV infection , 2012, BMC Infectious Diseases.

[54]  Liang Kong,et al.  Accurate prediction of protein structural classes by incorporating predicted secondary structure information into the general form of Chou's pseudo amino acid composition. , 2014, Journal of theoretical biology.

[55]  Zhong-Hui Duan,et al.  Application of singular value decomposition and functional clustering to analyzing gene expression profiles of renal cell carcinoma , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[56]  K. Chou,et al.  iRSpot-TNCPseAAC: Identify Recombination Spots with Trinucleotide Composition and Pseudo Amino Acid Components , 2014, International journal of molecular sciences.

[57]  Jacques Lapointe,et al.  Theoretical and experimental biology in one—A symposium in honour of Professor Kuo-Chen Chou’s 50th anniversary and Professor Richard Giegé’s 40th anniversary of their scientific careers , 2013 .

[58]  Long Shi,et al.  A novel 3D graphical representation of RNA secondary structures based on chaos game representation , 2010, 2010 Sixth International Conference on Natural Computation.

[59]  Mehdi Yaghobi,et al.  Improved Protein Structural Class Prediction Based on Chaos Game Representation , 2010, 2010 Fourth Asia International Conference on Mathematical/Analytical Modelling and Computer Simulation.

[60]  K. Chou,et al.  iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types. , 2013, Analytical biochemistry.

[61]  K. Chou,et al.  iCDI-PseFpt: identify the channel-drug interaction in cellular networking with PseAAC and molecular fingerprints. , 2013, Journal of theoretical biology.

[62]  K. Chou,et al.  2D-MH: A web-server for generating graphic representation of protein sequences based on the physicochemical properties of their constituent amino acids. , 2010, Journal of theoretical biology.

[63]  Zu-Guo Yu,et al.  Prediction of protein structural classes by recurrence quantification analysis based on chaos game representation. , 2009 .

[64]  K. Chou,et al.  Predicting Anatomical Therapeutic Chemical (ATC) Classification of Drugs by Integrating Chemical-Chemical Interactions and Similarities , 2012, PloS one.

[65]  Marinko Dobec,et al.  Human papillomavirus infection among women with cytological abnormalities in Switzerland investigated by an automated linear array genotyping test , 2011, Journal of medical virology.

[66]  S. Basu,et al.  Chaos game representation of proteins. , 1997, Journal of molecular graphics & modelling.

[67]  James M. Keller,et al.  A fuzzy K-nearest neighbor algorithm , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[68]  Sukanta Mondal,et al.  Chou's pseudo amino acid composition improves sequence-based antifreeze protein prediction. , 2014, Journal of theoretical biology.

[69]  K. Chou,et al.  iLoc-Animal: a multi-label learning classifier for predicting subcellular localization of animal proteins. , 2013, Molecular bioSystems.

[70]  Sun Kim,et al.  Prediction of the Human Papillomavirus Risk Types Using Gap-Spectrum Kernels , 2006, ISNN.

[71]  Xiao Xuan,et al.  Predicting the Risk Type of Human Papillomaviruses Based on Sequence-Derived Features , 2011, 2011 5th International Conference on Bioinformatics and Biomedical Engineering.

[72]  Xiaolong Wang,et al.  Identification of DNA-binding proteins by incorporating evolutionary information into pseudo amino acid composition via the top-n-gram approach , 2015, Journal of biomolecular structure & dynamics.

[73]  Nana Li,et al.  Subcellular Locations Prediction of Proteins Based on Chaos Game Representation , 2009, 2009 3rd International Conference on Bioinformatics and Biomedical Engineering.

[74]  Kuo-Chen Chou,et al.  Prediction of Membrane Protein Types by Incorporating Amphipathic Effects , 2005, J. Chem. Inf. Model..

[75]  Wei-yuan Zhang,et al.  Identification of biomarkers for cervical cancer in peripheral blood lymphocytes using oligonucleotide microarrays. , 2010, Chinese medical journal.

[76]  K. Chou,et al.  iSS-PseDNC: Identifying Splicing Sites Using Pseudo Dinucleotide Composition , 2014, BioMed research international.

[77]  Wei-yuan Zhang,et al.  [Identification biomarkers for cervical cancer in peripheral blood lymphocytes by oligonucleotide microarrays]. , 2010, Zhonghua yi xue za zhi.

[78]  Kuo-Chen Chou,et al.  Prediction and classification of protein subcellular location—sequence‐order effect and pseudo amino acid composition , 2003, Journal of cellular biochemistry.

[79]  Sheng-Xiang Lin,et al.  Theoretical and experimental biology in one — , 2013 .

[80]  Jon G. Rokne,et al.  Combining singular value decomposition and t-test into hybrid approach for significant gene extraction from microarray data , 2008, 2008 8th IEEE International Conference on BioInformatics and BioEngineering.

[81]  Achuthsankar S. Nair,et al.  ANN Based Classification of Unknown Genome Fragments Using Chaos Game Representation , 2010, 2010 Second International Conference on Machine Learning and Computing.

[82]  K. Chou Graphic rule for drug metabolism systems. , 2010, Current drug metabolism.

[83]  Jun Lu,et al.  Pathway level analysis of gene expression using singular value decomposition , 2005, BMC Bioinformatics.

[84]  K. Chou,et al.  Recent progress in protein subcellular location prediction. , 2007, Analytical biochemistry.