A Novel Hybridized Rough Set and Improved Harmony Search Based Feature Selection for Protein Sequence Classification

The progress in bio-informatics and biotechnology area has generated a big amount of sequence data that requires a detailed analysis. Recent advances in future generation sequencing technologies have resulted in a tremendous raise in the rate of that protein sequence data are being obtained. Big Data analysis is a clear bottleneck in many applications, especially in the field of bio-informatics, because of the complexity of the data that needs to be analyzed. Protein sequence analysis is a significant problem in functional genomics. Proteins play an essential role in organisms as they perform many important tasks in their cells. In general, protein sequences are exhibited by feature vectors. A major problem of protein dataset is the complexity of its analysis due to their enormous number of features. Feature selection techniques are capable of dealing with this high dimensional space of features. In this chapter, the new feature selection algorithm that combines the Improved Harmony Search algorithm with Rough Set theory for Protein sequences is proposed to successfully tackle the big data problems. An Improved harmony search (IHS) algorithm is a comparatively new population based meta-heuristic optimization algorithm. This approach imitates the music improvisation process, where each musician improvises their instrument’s pitch by seeking for a perfect state of harmony and it overcomes the limitations of traditional harmony search (HS) algorithm. An Improved Harmony Search hybridized with Rough Set Quick Reduct for faster and better search capabilities. The feature vectors are extracted from protein sequence database, based on amino acid composition and K-mer patterns or K-tuples and then feature selection is carried out from the extracted feature vectors. The proposed algorithm is compared with the two prominent algorithms, Rough Set Quick Reduct and Rough Set based PSO Quick Reduct. The experiments are carried out on protein primary single sequence data sets that are derived from PDB on SCOP classification, based on the structural class predictions such as all α, all β, all α + β and all α/ β. The feature subset of the protein sequences predicted by both existing and proposed algorithms are analyzed with the decision tree classification algorithms.

[1]  Saman Poursiah Navi Using Harmony Clustering for Haplotype Reconstruction from SNP fragments , 2013, BSBT 2013.

[2]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[3]  K. Lee,et al.  A new meta-heuristic algorithm for continuous engineering optimization: harmony search theory and practice , 2005 .

[4]  Yonghong Peng,et al.  A novel feature selection approach for biomedical data classification , 2010, J. Biomed. Informatics.

[5]  Ajith Abraham,et al.  An Improved Harmony Search Algorithm with Differential Mutation Operator , 2009, Fundam. Informaticae.

[6]  Zdzislaw Pawlak,et al.  Rough sets and intelligent data analysis , 2002, Inf. Sci..

[7]  Alex A. Freitas,et al.  A Tutorial on Hierarchical Classification with Applications in Bioinformatics. , 2007 .

[8]  Xiaoyong Zou,et al.  Using pseudo-amino acid composition and support vector machine to predict protein structural class. , 2006, Journal of theoretical biology.

[9]  Hao Lin,et al.  Prediction of subcellular location of mycobacterial protein using feature selection techniques , 2010, Molecular Diversity.

[10]  Ali Kattan,et al.  Harmony Search Based Supervised Training of Artificial Neural Networks , 2010, 2010 International Conference on Intelligent Systems, Modelling and Simulation.

[11]  Lakhmi C. Jain,et al.  Knowledge-Based Intelligent Information and Engineering Systems , 2004, Lecture Notes in Computer Science.

[12]  Cornelia Caragea,et al.  Protein Sequence Classification Using Feature Hashing , 2011, BIBM.

[13]  Weixin Xie,et al.  A Novel Hybrid Feature Selection Method Based on IFSFFS and SVM for the Diagnosis of Erythemato-Squamous Diseases , 2010, WAPA.

[14]  Z. Geem Particle-swarm harmony search for water network design , 2009 .

[15]  Rolf Drechsler,et al.  Applications of Evolutionary Computing, EvoWorkshops 2008: EvoCOMNET, EvoFIN, EvoHOT, EvoIASP, EvoMUSART, EvoNUM, EvoSTOC, and EvoTransLog, Naples, Italy, March 26-28, 2008. Proceedings , 2008, EvoWorkshops.

[16]  Marc Boullé,et al.  Multivariate Discretization by Recursive Supervised Bipartition of Graph , 2005, MLDM.

[17]  H. Inbarani,et al.  Unsupervised hybrid PSO — Relative reduct approach for feature reduction , 2012, International Conference on Pattern Recognition, Informatics and Medical Engineering (PRIME-2012).

[18]  Russell C. Eberhart,et al.  Parameter Selection in Particle Swarm Optimization , 1998, Evolutionary Programming.

[19]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[20]  Russell C. Eberhart,et al.  A new optimizer using particle swarm theory , 1995, MHS'95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science.

[21]  A. E. Eiben,et al.  Evolutionary Programming VII , 1998, Lecture Notes in Computer Science.

[22]  M. Fesanghary,et al.  An improved harmony search algorithm for solving optimization problems , 2007, Appl. Math. Comput..

[23]  Yong-Sheng Ding,et al.  Prediction of subcellular location apoptosis proteins with ensemble classifier and feature selection , 2010, Amino Acids.

[24]  Nasser Ghasem-Aghaee,et al.  Application of ant colony optimization for feature selection in text categorization , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[25]  Mohammed Azmi Al-Betar,et al.  A Harmony Search with Multi-pitch Adjusting Rate for the University Course Timetabling , 2010, Recent Advances In Harmony Search Algorithm.

[26]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[27]  Ankush Mittal,et al.  Protein Structure and Fold Prediction Using Tree-augmented Naïve Bayesian Classifier , 2005, J. Bioinform. Comput. Biol..

[28]  Xiangyang Wang,et al.  Feature selection based on rough sets and particle swarm optimization , 2007, Pattern Recognit. Lett..

[29]  Zong Woo Geem,et al.  Music Composition Using Harmony Search Algorithm , 2009, EvoWorkshops.

[30]  Zong Woo Geem,et al.  A New Heuristic Optimization Algorithm: Harmony Search , 2001, Simul..

[31]  Nasser Ghasem-Aghaee,et al.  Using Ant Colony Optimization-Based Selected Features for Predicting Post-synaptic Activity in Proteins , 2008, EvoBIO.

[32]  B. Rost,et al.  Automatic prediction of protein function , 2003, Cellular and Molecular Life Sciences CMLS.

[33]  H. Hannah Inbarani,et al.  Soft set based quick reduct approach for unsupervised feature selection , 2012, 2012 IEEE International Conference on Advanced Communication Control and Computing Technologies (ICACCCT).

[34]  Margherita Napoli,et al.  Model Checking for Graded CTL , 2009, Fundam. Informaticae.

[35]  Qiang Shen,et al.  Semantics-preserving dimensionality reduction: rough and fuzzy-rough-based approaches , 2004, IEEE Transactions on Knowledge and Data Engineering.

[36]  Ahmad Taher Azar,et al.  Hybrid Tolerance Rough Set: PSO Based Supervised Feature Selection for Digital Mammogram Images , 2013, Int. J. Fuzzy Syst. Appl..

[37]  Jiang Wang,et al.  Prediction of protein structural class with Rough Sets , 2006, BMC Bioinformatics.

[38]  B. Polić,et al.  Cellular and Molecular Life Sciences MULTI-AUTHOR REVIEW Regulation of immune cell function and differentiation , 2022 .

[39]  G. Nolan,et al.  Computational solutions to large-scale data management and analysis , 2010, Nature Reviews Genetics.

[40]  Aboul Ella Hassanien,et al.  Dimensionality reduction of medical big data using neural-fuzzy classifier , 2014, Soft Computing.

[41]  Mohammad Davarpanah Jazi,et al.  A Novel Text-Independent Speaker Verification System Using Ant Colony Optimization Algorithm , 2008, ICISP.

[42]  Abderrahim Elmoataz,et al.  Image and Signal Processing, 4th International Conference, ICISP 2010, Trois-Rivières, QC, Canada, June 30-July 2, 2010. Proceedings , 2010, ICISP.

[43]  Walter Vogler,et al.  Avoiding irreducible CSC conflicts by internal communication , 2008, 2008 8th International Conference on Application of Concurrency to System Design.

[44]  Ahmad Taher Azar,et al.  Supervised hybrid feature selection based on PSO and rough sets for medical diagnosis , 2014, Comput. Methods Programs Biomed..

[45]  Ahmad Taher Azar,et al.  Neuro-fuzzy feature selection approach based on linguistic hedges for medical diagnosis , 2014, Int. J. Model. Identif. Control..

[46]  Hagit Shatkay,et al.  Protein Function Prediction using Text-based Features extracted from the Biomedical Literature: The CAFA Challenge , 2013, BMC Bioinformatics.

[47]  H. Hannah Inbarani,et al.  Unsupervised hybrid PSO - Quick reduct approach for feature reduction , 2012, 2012 International Conference on Recent Trends in Information Technology.

[48]  Zong Woo Geem,et al.  Recent Advances In Harmony Search Algorithm , 2010, Recent Advances In Harmony Search Algorithm.

[49]  Jon Atli Benediktsson,et al.  A novel supervised feature selection technique based on genetic algorithms , 2012, 2012 IEEE International Geoscience and Remote Sensing Symposium.

[50]  Thangavel,et al.  Unsupervised Quick Reduct Algorithm Using Rough Set Theory , 2011 .

[51]  Pa-Chun Wang,et al.  Particle swarm optimization for feature selection with application in obstructive sleep apnea diagnosis , 2011, Neural Computing and Applications.

[52]  K. Thangavel,et al.  Verdict Accuracy of Quick Reduct Algorithm using Clustering and Classification Techniques for Gene Expression Data , 2013, ArXiv.

[53]  Ahmad Taher Azar,et al.  Feature selection using swarm-based relative reduct technique for fetal heart rate , 2014, Neural Computing and Applications.

[54]  C. P. Chandran,et al.  Feature selection from protein primary sequence database using Enhanced QuickReduct Fuzzy-Rough set , 2008, 2008 IEEE International Conference on Granular Computing.

[55]  Computational approaches for biological data analysis , 2010 .

[56]  S.-W. Zhang,et al.  Prediction of protein subcellular localization by support vector machines using multi-scale energy and pseudo amino acid composition , 2007, Amino Acids.

[57]  S. O. Degertekin Optimum design of steel frames using harmony search algorithm , 2008 .

[58]  C. Orengo,et al.  Protein function prediction--the power of multiplicity. , 2009, Trends in biotechnology.

[59]  Nasser Ghasem-Aghaee,et al.  A novel ACO-GA hybrid algorithm for feature selection in protein function prediction , 2009, Expert Syst. Appl..

[60]  Ahmad Taher Azar,et al.  PSORR - An unsupervised feature selection technique for fetal heart rate , 2013, 2013 5th International Conference on Modelling, Identification and Control (ICMIC).

[61]  Minoru Kanehisa,et al.  Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs , 2003, Bioinform..

[62]  Jack Sklansky,et al.  A note on genetic algorithms for large-scale feature selection , 1989, Pattern Recognit. Lett..

[63]  Xin Wang,et al.  PseAAC-Builder: a cross-platform stand-alone program for generating various special Chou's pseudo-amino acid compositions. , 2012, Analytical biochemistry.

[64]  Zdzisław Pawlak,et al.  Rough sets. Present state and the future , 1993 .

[65]  Hao Wang,et al.  Feature Similarity Based Redundancy Reduction for Gene Selection , 2006, DMIN.

[66]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[67]  S. Kotsiantis,et al.  Discretization Techniques: A recent survey , 2006 .

[68]  Chang-Biau Yang,et al.  Prediction of Protein Essentiality by the Support Vector Machine with Statistical Tests , 2012, ICMLA.

[69]  Zong Woo Geem,et al.  Improved Harmony Search from Ensemble of Music Players , 2006, KES.

[70]  Marco Dorigo,et al.  The hyper-cube framework for ant colony optimization , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[71]  Mandava Rajeswari,et al.  The variants of the harmony search algorithm: an overview , 2011, Artificial Intelligence Review.

[72]  Rami N. Khushaba,et al.  A Population Based Feature Subset Selection Algorithm Guided by Fuzzy Feature Dependency , 2012, AMLTA.

[73]  Qiang Shen,et al.  Rough set-aided keyword reduction for text categorization , 2001, Appl. Artif. Intell..

[74]  Rabie A. Ramadan,et al.  Advanced Machine Learning Technologies and Applications , 2014, Communications in Computer and Information Science.

[75]  C. A. Murthy,et al.  Unsupervised Feature Selection Using Feature Similarity , 2002, IEEE Trans. Pattern Anal. Mach. Intell..