Feature selection and classification for gene expression data using novel correlation based overlapping score method via Chou’s 5-steps rule

Abstract The analysis of omics data together with knowledge-based interpretation can help obtaining important information regarding different biological processes and to reflect the current physiological status of tissue and cells. The main challenge, however, is to analyze high-dimensional gene expression data consisting of a massive amount of redundant genes in extracting disease-related information. To address this problem, gene selection, that eliminates redundant and irrelevant genes, has been a key step. In current article, a feature selection technique is proposed that exploit correlation based overlapping analysis of expression data across classes. The proposed correlation based overlapping score (COS) technique is compared with state-of-the-art gene selection approaches using real-world benchmark microarray datasets. In an experimental evaluation, the COS algorithm outperforms the other methods with minimum misclassification errors obtained via boosting, random forest and k-nearest neighbour (kNN) classifiers. Moreover, the proposed technique is more stable than the other techniques in gene selection.

[1]  Pufeng Du,et al.  PseAAC-General: Fast Building Various Modes of General Form of Chou’s Pseudo-Amino Acid Composition for Large-Scale Protein Datasets , 2014, International journal of molecular sciences.

[2]  Andrew Harrison,et al.  A feature selection method for classification within functional genomics experiments based on the proportional overlapping score , 2014, BMC Bioinformatics.

[3]  S. Ramaswamy,et al.  Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. , 2002, Cancer research.

[4]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[5]  K. Chou,et al.  Using LogitBoost classifier to predict protein structural classes. , 2006, Journal of theoretical biology.

[6]  K. Chou Structural bioinformatics and its impact to biomedical science. , 2004, Current medicinal chemistry.

[7]  J. Chou,et al.  Structure and mechanism of the M2 proton channel of influenza A virus , 2008, Nature.

[8]  Xin Wang,et al.  PseAAC-Builder: a cross-platform stand-alone program for generating various special Chou's pseudo-amino acid compositions. , 2012, Analytical biochemistry.

[9]  Jan Komorowski,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm486 Data and text mining Monte Carlo , 2022 .

[10]  Wei Chen,et al.  iRNA-PseU: Identifying RNA pseudouridine sites , 2016, Molecular therapy. Nucleic acids.

[11]  Kuo-Chen Chou,et al.  pLoc_bal-mVirus: predict subcellular localization of multi-label virus proteins by PseAAC and IHTS treatment to balance training dataset. , 2018, Medicinal chemistry (Shariqah (United Arab Emirates)).

[12]  Kuo-Chen Chou,et al.  Predicting Functions of Proteins in Mouse Based on Weighted Protein-Protein Interaction Network and Protein Hybrid Properties , 2011, PloS one.

[13]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[14]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[15]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[16]  Kuo-Chen Chou,et al.  pLoc_bal-mEuk: predict subcellular localization of eukaryotic proteins by general PseAAC and quasi-balancing training dataset. , 2019, Medicinal chemistry (Shariqah (United Arab Emirates)).

[17]  B. Liu,et al.  Identification of Real MicroRNA Precursors with a Pseudo Structure Status Composition Approach , 2015, PloS one.

[18]  K. Chou Progresses in Predicting Post-translational Modification , 2019, International Journal of Peptide Research and Therapeutics.

[19]  K. Chou Pseudo Amino Acid Composition and its Applications in Bioinformatics, Proteomics and System Biology , 2009 .

[20]  Liangliang Kong,et al.  Architecture of the Mitochondrial Calcium Uniporter , 2016, Nature.

[21]  Shuo Li,et al.  MsDBP: Exploring DNA-binding Proteins by Integrating Multi-scale Sequence Information via Chou's 5-steps Rule. , 2019, Journal of proteome research.

[22]  Shi-Yun Wang,et al.  Prediction of lysine formylation sites using the composition of k-spaced amino acid pairs via Chou's 5-steps rule and general pseudo components. , 2019, Genomics.

[23]  N. Ramaraj,et al.  A novel hybrid feature selection via Symmetrical Uncertainty ranking based local memetic search algorithm , 2010, Knowl. Based Syst..

[24]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[25]  K. Chou,et al.  pLoc-mEuk: Predict subcellular localization of multi-label eukaryotic proteins by extracting the key GO information into general PseAAC. , 2018, Genomics.

[26]  Prabina Kumar Meher,et al.  Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou’s general PseAAC , 2017, Scientific Reports.

[27]  Kuo-Chen Chou,et al.  pLoc_bal-mHum: Predict subcellular localization of human proteins by PseAAC and quasi-balancing training dataset. , 2019, Genomics.

[28]  K. Chou,et al.  PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. , 2014, Analytical biochemistry.

[29]  Kuo-Chen Chou,et al.  pLoc_bal-mPlant: Predict Subcellular Localization of Plant Proteins by General PseAAC and Balancing Training Dataset. , 2019, Current pharmaceutical design.

[30]  Ahmad Hassan Butt,et al.  Prediction of S-Sulfenylation Sites Using Statistical Moments Based Features via CHOU’S 5-Step Rule , 2019, International Journal of Peptide Research and Therapeutics.

[31]  Dong-Sheng Cao,et al.  propy: a tool to generate various modes of Chou's PseAAC , 2013, Bioinform..

[32]  Kuo-Chen Chou,et al.  pLoc‐mAnimal: predict subcellular localization of animal proteins with both single and multiple sites , 2017, Bioinform..

[33]  Kuo-Chen Chou,et al.  pLoc_bal‐mAnimal: predict subcellular localization of animal proteins by balancing training dataset and PseAAC , 2018, Bioinform..

[34]  K. Chou Impacts of bioinformatics to medicinal chemistry. , 2015, Medicinal chemistry (Shariqah (United Arab Emirates)).

[35]  Hinrich W. H. Göhlmann,et al.  I/NI-calls for the exclusion of non-informative genes: a highly effective filtering tool for microarray data , 2007, Bioinform..

[36]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[37]  J. Chou,et al.  Unusual architecture of the p7 channel from hepatitis C virus , 2013, Nature.

[38]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[39]  S. Harrison,et al.  Mitochondrial uncoupling protein 2 structure determined by NMR molecular fragment searching , 2011, Nature.

[40]  Kuo-Chen Chou,et al.  pLoc-mPlant: predict subcellular localization of multi-location plant proteins by incorporating the optimal GO information into general PseAAC. , 2017, Molecular bioSystems.

[41]  Ren Long,et al.  iRSpot-EL: identify recombination spots with an ensemble learning approach , 2017, Bioinform..

[42]  K. Chou,et al.  PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. , 2008, Analytical biochemistry.

[43]  K. Chou Impacts of pseudo amino acid components and 5-steps rule to proteomics and proteome analysis. , 2019, Current topics in medicinal chemistry.

[44]  K. Chou,et al.  An optimization approach to predicting protein structural class from amino acid composition , 1992, Protein science : a publication of the Protein Society.

[45]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[46]  Joanna Polanska,et al.  Adaptive filtering of microarray gene expression data based on Gaussian mixture decomposition , 2013, BMC Bioinformatics.

[47]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[48]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[49]  Kuo-Chen Chou,et al.  A Two-Level Computation Model Based on Deep Learning Algorithm for Identification of piRNA and Their Functions via Chou’s 5-Steps Rule , 2019, International Journal of Peptide Research and Therapeutics.

[50]  Michael S. Seaman,et al.  Structural basis for membrane anchoring of HIV-1 envelope spike , 2016, Science.

[51]  E. Baralis,et al.  The Painter's Feature Selection for Gene Expression Data , 2007, 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[52]  Driss Aboutajdine,et al.  Application of Symmetric Uncertainty and Mutual Information to Dimensionality Reduction and Classification of Hyperspectral Images , 2012, 1211.0613.

[53]  Kuo-Chen Chou,et al.  pLoc-mGneg: Predict subcellular localization of Gram-negative bacterial proteins by deep gene ontology learning via general PseAAC. , 2017, Genomics.

[54]  Kuo-Chen Chou,et al.  pLoc-mHum: predict subcellular localization of multi-location human proteins via general PseAAC to winnow out the crucial GO information , 2018, Bioinform..

[55]  Kuo-Chen Chou,et al.  An Unprecedented Revolution in Medicinal Chemistry Driven by the Progress of Biological Science. , 2017, Current topics in medicinal chemistry.

[56]  Kuo-Chen Chou,et al.  Prediction and classification of protein subcellular location—sequence‐order effect and pseudo amino acid composition , 2003, Journal of cellular biochemistry.

[57]  Ren Long,et al.  iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition , 2016, Bioinform..

[58]  Sher Afzal Khan,et al.  Bi-PSSM: Position specific scoring matrix based intelligent computational model for identification of mycobacterial membrane proteins. , 2017, Journal of theoretical biology.

[59]  K. Chou,et al.  Bioinformatical analysis of G-protein-coupled receptors. , 2002, Journal of proteome research.

[60]  K. Chou,et al.  Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. , 2015, Molecular bioSystems.

[61]  James G. Lyons,et al.  Gram-positive and Gram-negative protein subcellular localization by incorporating evolutionary-based descriptors into Chou׳s general PseAAC. , 2015, Journal of theoretical biology.

[62]  Fan Yang,et al.  iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC , 2018, Bioinform..

[63]  Kuo-Chen Chou,et al.  pLoc-mGpos: Incorporate Key Gene Ontology Information into General PseAAC for Predicting Subcellular Localization of Gram-Positive Bacterial Proteins , 2017 .

[64]  Bin Liu,et al.  Pse-in-One 2.0: An Improved Package of Web Servers for Generating Various Modes of Pseudo Components of DNA, RNA, and Protein Sequences , 2017 .

[65]  Abbas A. Dehghani-Sanij,et al.  Computational model for the recognition of lower limb movement using wearable gyroscope sensor , 2019, Int. J. Sens. Networks.

[66]  Muhammad Kabir,et al.  iNR-2L: A two-level sequence-based predictor developed via Chou's 5-steps rule and general PseAAC for identifying nuclear receptors and their families. , 2019, Genomics.

[67]  Ludwig Lausser,et al.  Measuring and visualizing the stability of biomarker selection techniques , 2011, Computational Statistics.

[68]  K. Chou Advance in predicting subcellular localization of multi-label proteins and its implication for developing multi-target drugs. , 2019, Current medicinal chemistry.

[69]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[70]  Kuo-Chen Chou,et al.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes , 2005, Bioinform..