A Two-Step Feature Selection Method to Predict Cancerlectins by Multiview Features and Synthetic Minority Oversampling Technique

Cancerlectins have an inhibitory effect on the growth of cancer cells and are currently being employed as therapeutic agents. The accurate identification of the cancerlectins should provide insight into the molecular mechanisms of cancers. In this study, a new computational method based on the RF (Random Forest) algorithm is proposed for further improving the performance of identifying cancerlectins. Hybrid feature space before feature selection is developed by combining different individual feature spaces, CTD (Composition, Transition, and Distribution), PseAAC (Pseudo Amino Acid Composition), PSSM (Position-Specific Scoring Matrix), and disorder. The SMOTE (Synthetic Minority Oversampling Technique) is applied to solve the imbalanced data problem. To reduce feature redundancy and computation complexity, we propose a two-step feature selection process to select informative features. A 5-fold cross-validation technique is used for the evaluation of various prediction strategies. The proposed method achieves a sensitivity of 0.779, a specificity of 0.717, an accuracy of 0.748, and an MCC (Matthew's Correlation Coefficient) of 0.497. The prediction results are also compared with other existing methods on the same dataset using 5-fold cross-validation. The comparison results demonstrate the high effectiveness of our method for predicting cancerlectins.

[1]  Z. Shi,et al.  In silico analysis of molecular mechanisms of legume lectin‐induced apoptosis in cancer cells , 2013, Cell proliferation.

[2]  Kuo-Chen Chou,et al.  Predicting protein oxidation sites with feature selection and analysis approach , 2012, Journal of biomolecular structure & dynamics.

[3]  A. Dunker,et al.  Predicting intrinsic disorder in proteins: an overview , 2009, Cell Research.

[4]  Sukanta Mondal,et al.  Chou's pseudo amino acid composition improves sequence-based antifreeze protein prediction. , 2014, Journal of theoretical biology.

[5]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[6]  K. Chou,et al.  iSS-PseDNC: Identifying Splicing Sites Using Pseudo Dinucleotide Composition , 2014, BioMed research international.

[7]  Neesar Ahmed,et al.  Lectins-the promising cancer therapeutics , 2014 .

[8]  N. Sharon,et al.  Lectins as cell recognition molecules. , 1989, Science.

[9]  Mona Singh,et al.  Predicting functionally important residues from sequence conservation , 2007, Bioinform..

[10]  Hafiz Ahmed,et al.  Animal lectins : a functional view , 2008 .

[11]  I. Muchnik,et al.  Prediction of protein folding class using global description of amino acid sequence. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  Srinivasan Ramachandran,et al.  Evolutionary history and stress regulation of the lectin superfamily in higher plants , 2010, BMC Evolutionary Biology.

[14]  Wei Chen,et al.  Predicting cancerlectins by the optimal g-gap dipeptides , 2015, Scientific Reports.

[15]  Gabriel A. Rabinovich,et al.  Galectins as modulators of tumour progression , 2005, Nature Reviews Cancer.

[16]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[17]  K. Chou,et al.  Prediction of Antimicrobial Peptides Based on Sequence Alignment and Feature Selection Methods , 2011, PloS one.

[18]  Yu-Chu Tian,et al.  An Ensemble Method for Predicting Subnuclear Localizations from Primary Protein Structures , 2013, PloS one.

[19]  S. Wold,et al.  DNA and peptide sequences and chemical processes multivariately modelled by principal component analysis and partial least-squares projections to latent structures , 1993 .

[20]  Joan Palou,et al.  Galectin-3 expression is associated with bladder cancer progression and clinical outcome , 2010, Tumor Biology.

[21]  K. Abbott,et al.  Lectin-based glycoproteomic techniques for the enrichment and identification of potential biomarkers. , 2010, Methods in enzymology.

[22]  M. Swanson,et al.  A Lectin Isolated from Bananas Is a Potent Inhibitor of HIV Replication* , 2010, The Journal of Biological Chemistry.

[23]  Zoran Obradovic,et al.  Length-dependent prediction of protein intrinsic disorder , 2006, BMC Bioinformatics.

[24]  Serge Pérez,et al.  Glyco3D: a portal for structural glycosciences. , 2015, Methods in molecular biology.

[25]  K. Chou,et al.  iACP: a sequence-based tool for identifying anticancer peptides , 2016, Oncotarget.

[26]  K. Chou,et al.  iHSP-PseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. , 2013, Analytical biochemistry.

[27]  Pierre Baldi,et al.  SOLpro: accurate sequence-based prediction of protein solubility , 2009, Bioinform..

[28]  Nagasuma R. Chandra,et al.  CancerLectinDB: a database of lectins relevant to cancer , 2008, Glycoconjugate Journal.

[29]  Ravinder Singh,et al.  Fast-Find: A novel computational approach to analyzing combinatorial motifs , 2006, BMC Bioinformatics.

[30]  Azuraliza Abu Bakar,et al.  A review of feature selection techniques in sentiment analysis , 2019, Intell. Data Anal..

[31]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[32]  E. D. de Mejia,et al.  Lectins as Bioactive Plant Proteins: A Potential in Cancer Treatment , 2005, Critical reviews in food science and nutrition.

[33]  K. Chou,et al.  iRNA-Methyl: Identifying N(6)-methyladenosine sites using pseudo nucleotide composition. , 2015, Analytical biochemistry.

[34]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[35]  P. Suganthan,et al.  SPRED: A machine learning approach for the identification of classical and non-classical secretory proteins in mammalian genomes. , 2010, Biochemical and biophysical research communications.

[36]  Gajendra PS Raghava,et al.  Analysis and prediction of cancerlectins using evolutionary and domain information , 2011, BMC Research Notes.

[37]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[38]  Hiroaki Tateno,et al.  Lectin Engineering, a Molecular Evolutionary Approach to Expanding the Lectin Utilities , 2015, Molecules.

[39]  Pierre-Antoine Gourraud,et al.  Galectin-1 is a powerful marker to distinguish chondroblastic osteosarcoma and conventional chondrosarcoma. , 2010, Human pathology.

[40]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[41]  Niu Xiaohui,et al.  Using the concept of Chou's pseudo amino acid composition to predict protein solubility: an approach with entropies in information theory. , 2013, Journal of theoretical biology.

[42]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[43]  S. Choi,et al.  Mistletoe lectin induces apoptosis and telomerase inhibition in human A253 cancer cells through dephosphorylation of akt , 2004, Archives of pharmacal research.

[44]  Bo Yao,et al.  PhosphoSVM: prediction of phosphorylation sites by integrating various protein sequence attributes with a support vector machine , 2014, Amino Acids.

[45]  T. Ng,et al.  A Lectin with Highly Potent Inhibitory Activity toward Breast Cancer Cells from Edible Tubers of Dioscorea opposita cv. Nagaimo , 2013, PloS one.

[46]  H. Dyson,et al.  Intrinsically unstructured proteins and their functions , 2005, Nature Reviews Molecular Cell Biology.

[47]  N. Sharon Lectins: Carbohydrate-specific Reagents and Biological Recognition Molecules , 2007, Journal of Biological Chemistry.

[48]  Jaime G. Carbonell,et al.  Active learning for human protein-protein interaction prediction , 2010, BMC Bioinformatics.

[49]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[50]  Wei Chen,et al.  iTIS-PseTNC: a sequence-based predictor for identifying translation initiation site in human genes using pseudo trinucleotide composition. , 2014, Analytical biochemistry.

[51]  Wei Chen,et al.  iDNA4mC: identifying DNA N4‐methylcytosine sites based on nucleotide chemical properties , 2017, Bioinform..

[52]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[53]  Hafiz Ahmed,et al.  Structural and functional diversity of lectin repertoires in invertebrates, protochordates and ectothermic vertebrates. , 2004, Current opinion in structural biology.

[54]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[55]  Yan Huang,et al.  Predicting protein-ATP binding sites from primary sequence through fusing bi-profile sampling of multi-view features , 2012, BMC Bioinformatics.

[56]  Hilbert J. Kappen,et al.  The Cluster Variation Method for Efficient Linkage Analysis on Extended Pedigrees , 2006, BMC Bioinformatics.

[57]  Jing Hu,et al.  BS-KNN: An Effective Algorithm for Predicting Protein Subchloroplast Localization , 2012, Evolutionary bioinformatics online.

[58]  R. Lotan,et al.  Lectins in Cancer Cells , 1988, Annals of the New York Academy of Sciences.

[59]  D. Chi,et al.  Molecular defects in the mannose binding lectin pathway in dermatological disease: Case report and literature review , 2010, Clinical and molecular allergy : CMA.

[60]  Xiang-tao Li,et al.  Prediction of Lysine Ubiquitylation with Ensemble Classifier and Feature Selection , 2011, International journal of molecular sciences.