Active semi-supervised learning for biological data classification

Due to datasets have continuously grown, efforts have been performed in the attempt to solve the problem related to the large amount of unlabeled data in disproportion to the scarcity of labeled data. Another important issue is related to the trade-off between the difficulty in obtaining annotations provided by a specialist and the need for a significant amount of annotated data to obtain a robust classifier. In this context, active learning techniques jointly with semi-supervised learning are interesting. A smaller number of more informative samples previously selected (by the active learning strategy) and labeled by a specialist can propagate the labels to a set of unlabeled data (through the semi-supervised one). However, most of the literature works neglect the need for interactive response times that can be required by certain real applications. We propose a more effective and efficient active semi-supervised learning framework, including a new active learning method. An extensive experimental evaluation was performed in the biological context (using the ALL-AML, Escherichia coli and PlantLeaves II datasets), comparing our proposals with state-of-the-art literature works and different supervised (SVM, RF, OPF) and semi-supervised (YATSI-SVM, YATSI-RF and YATSI-OPF) classifiers. From the obtained results, we can observe the benefits of our framework, which allows the classifier to achieve higher accuracies more quickly with a reduced number of annotated samples. Moreover, the selection criterion adopted by our active learning method, based on diversity and uncertainty, enables the prioritization of the most informative boundary samples for the learning process. We obtained a gain of up to 20% against other learning techniques. The active semi-supervised learning approaches presented a better trade-off (accuracies and competitive and viable computational times) when compared with the active supervised learning ones.

[1]  Pengjiang Qian,et al.  Seizure Classification From EEG Signals Using Transfer Learning, Semi-Supervised Learning and TSK Fuzzy System , 2017, IEEE Transactions on Neural Systems and Rehabilitation Engineering.

[2]  Kurt Driessens,et al.  Using Weighted Nearest Neighbor to Benefit from Unlabeled Data , 2006, PAKDD.

[3]  Gary F. Egan,et al.  Multichannel Compressive Sensing MRI Using Noiselet Encoding , 2014, PloS one.

[4]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[5]  Vo Thi Ngoc Chau,et al.  Automatic de-identification of medical records with a multilevel hybrid semi-supervised learning approach , 2016, 2016 IEEE RIVF International Conference on Computing & Communication Technologies, Research, Innovation, and Vision for the Future (RIVF).

[6]  Vasant Honavar,et al.  Semi-supervised prediction of protein subcellular localization using abstraction augmented Markov models , 2010, BMC Bioinformatics.

[7]  Kenneth H. Wolfe,et al.  A pipeline for automated annotation of yeast genome sequences by a conserved-synteny approach , 2012, BMC Bioinformatics.

[8]  Er-Chen Huang,et al.  Big active learning , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[9]  Marie-Francine Moens,et al.  Semi-supervised Learning for the BioNLP Gene Regulation Network , 2015, BMC Bioinformatics.

[10]  Eric Granger,et al.  Bag-Level Aggregation for Multiple-Instance Active Learning in Instance Classification Problems , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[11]  Changyin Sun,et al.  Active Learning From Imbalanced Data: A Solution of Online Weighted Extreme Learning Machine , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[12]  Li Chen,et al.  Semi-automatic annotation of distorted image based on neighborhood rough set , 2018, 2018 13th IEEE Conference on Industrial Electronics and Applications (ICIEA).

[13]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[14]  Pedro Jussieu de Rezende,et al.  Robust active learning for the diagnosis of parasites , 2015, Pattern Recognit..

[15]  Tran Van Hoai,et al.  A novel semi-supervised algorithm for the taxonomic assignment of metagenomic reads , 2016, BMC Bioinformatics.

[16]  Lei Zhang,et al.  Active Self-Paced Learning for Cost-Effective and Progressive Face Identification , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Zhiguo Cao,et al.  Learning With Annotation of Various Degrees , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[18]  Uwe Stilla,et al.  Combining Active and Semisupervised Learning of Remote Sensing Data Within a Renyi Entropy Regularization Framework , 2016, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[19]  Paulo Drews,et al.  Microalgae classification using semi-supervised and active learning based on Gaussian mixture models , 2013, Journal of the Brazilian Computer Society.

[20]  Naif Alajlan,et al.  Large-Scale Image Classification Using Active Learning , 2014, IEEE Geoscience and Remote Sensing Letters.

[21]  ChengXiang Zhai,et al.  Automatic annotation of protein motif function with Gene Ontology terms , 2003, BMC Bioinformatics.

[22]  João Paulo Papa,et al.  Efficient supervised optimum-path forest classification for large datasets , 2012, Pattern Recognit..

[23]  Jaime G. Carbonell,et al.  Active learning for human protein-protein interaction prediction , 2010, BMC Bioinformatics.

[24]  Fabien Ringeval,et al.  Leveraging Unlabeled Data for Emotion Recognition With Enhanced Collaborative Semi-Supervised Learning , 2018, IEEE Access.

[25]  Hua Chai,et al.  A novel logistic regression model combining semi-supervised learning and active learning for disease classification , 2018, Scientific Reports.

[26]  Zhigang Luo,et al.  Semi-Supervised Projective Non-Negative Matrix Factorization for Cancer Classification , 2015, PloS one.

[27]  Huanhuan Chen,et al.  Semisupervised Negative Correlation Learning , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[28]  Olof Emanuelsson,et al.  Predicting Protein Subcellular Localisation From Amino Acid Sequence Information , 2002, Briefings Bioinform..

[29]  Qiang Yang,et al.  Semi-supervised protein subcellular localization , 2009, BMC Bioinformatics.

[30]  George Kesidis,et al.  A Maximum Entropy Framework for Semisupervised and Active Learning With Unknown and Label-Scarce Classes , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[31]  Nozha Boujemaa,et al.  The ImageCLEF 2012 Plant Identification Task , 2012, CLEF.

[32]  Michelangelo Ceci,et al.  Integrating microRNA target predictions for the discovery of gene regulatory networks: a semi-supervised ensemble learning approach , 2014, BMC Bioinformatics.

[33]  George Kesidis,et al.  Flow based botnet detection through semi-supervised active learning , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Nikos Fazakis,et al.  Combination of Active Learning and Semi-Supervised Learning under a Self-Training Scheme , 2019, Entropy.

[35]  Alexandre X. Falcão,et al.  Choosing the Most Effective Pattern Classification Model under Learning-Time Constraint , 2015, PloS one.

[36]  Robert F. Murphy,et al.  Efficient discovery of responses of proteins to compounds using active learning , 2013, BMC Bioinformatics.

[37]  Guohui Li,et al.  A Multi-modal Hashing Learning Framework for Automatic Image Annotation , 2017, 2017 IEEE Second International Conference on Data Science in Cyberspace (DSC).

[38]  Anant Madabhushi,et al.  An active learning based classification strategy for the minority class problem: application to histopathology annotation , 2011, BMC Bioinformatics.

[39]  Silvio C. E. Tosatto,et al.  Correct machine learning on protein sequences: a peer-reviewing perspective , 2016, Briefings Bioinform..

[40]  Moamar Sayed Mouchaweh,et al.  A Bi-Criteria Active Learning Algorithm for Dynamic Data Streams , 2018, IEEE Trans. Neural Networks Learn. Syst..

[41]  Pedro Jussieu de Rezende,et al.  Active Semi-supervised Learning Using Optimum-Path Forest , 2014, 2014 22nd International Conference on Pattern Recognition.

[42]  Min Song,et al.  Combining active learning and semi-supervised learning techniques to extract protein interaction sentences , 2011, BMC Bioinformatics.

[43]  Pedro Henrique Bugatti,et al.  Going Deeper on BioImages Classification: A Plant Leaf Dataset Case Study , 2017, CIARP.

[44]  Yang Zhang,et al.  Bioimaging-based detection of mislocalized proteins in human cancers by semi-supervised learning , 2015, Bioinform..

[45]  Iain Lake,et al.  Twitter mining using semi-supervised classification for relevance filtering in syndromic surveillance , 2019, PloS one.

[46]  Paul Horton,et al.  A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins , 1996, ISMB.

[47]  P Ravi Kiran Varma,et al.  A semi-supervised intrusion detection system using active learning SVM and fuzzy c-means clustering , 2017, 2017 International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC).

[48]  I. Simpson,et al.  Microliths in the South Asian rainforest ~45-4 ka: New insights from Fa-Hien Lena Cave, Sri Lanka , 2019, PloS one.

[49]  Dongrui Wu,et al.  Pool-Based Sequential Active Learning for Regression , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[50]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[51]  Antonio Ortega,et al.  Active semi-supervised learning using sampling theory for graph signals , 2014, KDD.

[52]  Concha Bielza,et al.  Machine Learning in Bioinformatics , 2008, Encyclopedia of Database Systems.

[53]  Eduardo Coutinho,et al.  Semi-Supervised Active Learning for Sound Classification in Hybrid Learning Environments , 2016, PloS one.

[54]  Md. Monirul Islam,et al.  A review on automatic image annotation techniques , 2012, Pattern Recognit..