Multi-Label Bioinformatics Data Classification With Ensemble Embedded Feature Selection

In bioinformatics, the vast of multi-label type of datasets, including clinical text, gene, and protein data, need to be categorized. Specifically, due to the redundant or irrelevant features in bioinformatics data, the performance of multi-label classifiers will be limited, and therefore, selecting effective features from the feature space is necessary. However, most of the proposed methods, which aimed at dealing with multi-label feature selection problem in the past few years, only adopt a simple and direct strategy that transforms the multi-label feature selection problem into more single-label ones and ignore correlations among different labels. In this paper, a novel algorithm named ensemble embedded feature selection (EEFS) is proposed to handle multi-label bioinformatics data learning problem in a more effective and efficient way. The EEFS does not only explicitly find out the correlations among labels, but it can also adequately utilize the label correlations by multi-label classifiers and evaluation measures. Furthermore, it can reduce the accumulated errors of data itself by employing an ensemble method. The experimental results on five multi-label bioinformatics datasets show that our algorithm achieves significant superiority over the other state-of-the-art algorithms.

[1]  Huan Liu,et al.  Embedded Unsupervised Feature Selection , 2015, AAAI.

[2]  John Shawe-Taylor,et al.  Semi-supervised feature learning from clinical text , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[3]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[4]  Hyoil Han,et al.  Approaches to text mining for clinical medical records , 2006, SAC '06.

[5]  Kuo-Chen Chou,et al.  pLoc-mVirus: Predict subcellular localization of multi-location virus proteins via incorporating the optimal GO information into general PseAAC. , 2017, Gene.

[6]  Cheong Hee Park,et al.  On applying linear discriminant analysis for multi-labeled problems , 2008, Pattern Recognit. Lett..

[7]  William R. Hersh,et al.  A Survey of Current Work in Biomedical Text Mining , 2005 .

[8]  Yiqin Wang,et al.  Symptom selection for multi-label data of inquiry diagnosis in traditional Chinese medicine , 2013, Science China Information Sciences.

[9]  Kuo-Chen Chou,et al.  pLoc-mHum: predict subcellular localization of multi-location human proteins via general PseAAC to winnow out the crucial GO information , 2018, Bioinform..

[10]  Ben Carterette,et al.  Improving health records search using multiple query expansion collections , 2012, 2012 IEEE International Conference on Bioinformatics and Biomedicine.

[11]  Julio López,et al.  An embedded feature selection approach for support vector classification via second-order cone programming , 2015, Intell. Data Anal..

[12]  Zhi-Hua Zhou,et al.  ML-KNN: A lazy learning approach to multi-label learning , 2007, Pattern Recognit..

[13]  Yan Chen,et al.  Embedded Feature Selection for Multi-label Classification of Music Emotions , 2012, Int. J. Comput. Intell. Syst..

[14]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[15]  Volker Tresp,et al.  Multi-label informed latent semantic indexing , 2005, SIGIR '05.

[16]  K. Bretonnel Cohen,et al.  Frontiers of biomedical text mining: current progress , 2007, Briefings Bioinform..

[17]  Guo-Zheng Li,et al.  Clinical multi-label free text classification by exploiting disease label relation , 2013, 2013 IEEE International Conference on Bioinformatics and Biomedicine.

[18]  David Page,et al.  Extracting BI-RADS features from Portuguese clinical texts , 2012, 2012 IEEE International Conference on Bioinformatics and Biomedicine.

[19]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[20]  Kuo-Chen Chou,et al.  pLoc-mPlant: predict subcellular localization of multi-location plant proteins by incorporating the optimal GO information into general PseAAC. , 2017, Molecular bioSystems.

[21]  Hans-Peter Kriegel,et al.  Multi-Output Regularized Feature Projection , 2006, IEEE Transactions on Knowledge and Data Engineering.

[22]  Guo-Zheng Li,et al.  Multilabel Learning via Random Label Selection for Protein Subcellular Multilocations Prediction , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[23]  K. Chou,et al.  Cell-PLoc 2.0: an improved package of web-servers for predicting subcellular localization of proteins in various organisms , 2010 .

[24]  Korris Fu-Lai Chung,et al.  An ensemble embedded feature selection method for multi-label clinical text classification , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[25]  K. Chou,et al.  Virus-mPLoc: A Fusion Classifier for Viral Protein Subcellular Location Prediction by Incorporating Multiple Sites , 2010, Journal of biomolecular structure & dynamics.

[26]  K. Chou,et al.  iLoc-Virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites. , 2011, Journal of theoretical biology.

[27]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[28]  Zhi-Hua Zhou,et al.  Multilabel dimensionality reduction via dependence maximization , 2008, TKDD.

[29]  Shuicheng Yan,et al.  Multi-label sparse coding for automatic image annotation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  K. Chou,et al.  pLoc-mEuk: Predict subcellular localization of multi-label eukaryotic proteins by extracting the key GO information into general PseAAC. , 2018, Genomics.

[31]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[33]  Qiuwen Zhang,et al.  MultiP-SChlo: Multi-label protein subchloroplast localization prediction , 2014, 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[34]  Min-Ling Zhang,et al.  A Review on Multi-Label Learning Algorithms , 2014, IEEE Transactions on Knowledge and Data Engineering.

[35]  Jieping Ye,et al.  Extracting shared subspace for multi-label classification , 2008, KDD.

[36]  Kazuyuki Murase,et al.  A new wrapper feature selection approach using neural network , 2010, Neurocomputing.

[37]  Chin-Hui Lee,et al.  A MFoM learning approach to robust multiclass multi-label text categorization , 2004, ICML.

[38]  Min-Ling Zhang,et al.  Lift: Multi-Label Learning with Label-Specific Features , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  K. Chou,et al.  Plant-mPLoc: A Top-Down Strategy to Augment the Power for Predicting Plant Protein Subcellular Localization , 2010, PloS one.

[40]  Liang Tao,et al.  A least squares formulation of multi-label linear discriminant analysis , 2015, Neurocomputing.

[41]  Jiebo Luo,et al.  Learning multi-label scene classification , 2004, Pattern Recognit..

[42]  王晓,et al.  MultiP-SChlo: multi-label protein subchloroplast localization prediction with Chou’s pseudo amino acid composition and a novel multi-label classifier Bioinformatics , 2015 .

[43]  K. Bretonnel Cohen,et al.  A shared task involving multi-label classification of clinical free text , 2007, BioNLP@ACL.

[44]  Jason Weston,et al.  A kernel method for multi-labelled classification , 2001, NIPS.

[45]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[46]  Lluís A. Belanche Muñoz,et al.  Feature selection algorithms: a survey and experimental evaluation , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[47]  Marko Robnik-Sikonja,et al.  Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[48]  Xiaohua Hu,et al.  Multilabel Learning for Protein Subcellular Location Prediction , 2012, IEEE Transactions on NanoBioscience.

[49]  Geoff Holmes,et al.  Classifier chains for multi-label classification , 2009, Machine Learning.