Study of transductive learning and unsupervised feature construction methods for biological sequence classification

Next Generation Sequencing (NGS) technologies have led to fast and inexpensive production of large amounts of biological sequence data, including nucleotide sequences and derived protein sequences. These fast-increasing volumes of data pose challenges to computational methods for annotation. Machine learning approaches, primarily supervised algorithms, have been widely used to assist with classification tasks in bioinformatics. However, supervised algorithms rely on large amounts of labeled data in order to produce quality predictors. Oftentimes, labeled data is difficult and expensive to acquire in sufficiently large quantities. When only limited amounts of labeled data but considerably larger amounts of unlabeled data are available for a specific annotation problem, semi-supervised learning approaches represent a cost-effective alternative. In this work, we focus on a special case of semi-supervised learning, namely transductive learning, in which the algorithm has access during the training phase to the instances that need to be labeled. Transduction is particularly suitable for biological sequence classification, where the goal is generally to label a given set of unlabeled instances. However, a challenge that needs to be addressed in this context consists of identification of compact sets of informative features. Given the lack of labeled data, standard supervised feature selection methods may result in unreliable features. Therefore, we study recently proposed unsupervised feature construction approaches together with transductive learning. Experimental results on two classification problems, namely cassette exon identification and protein localization, show that the unsupervised features result in better performance than the supervised features.

[1]  Shankar Kumar,et al.  Video suggestion and discovery for youtube: taking random walks through the view graph , 2008, WWW.

[2]  Fernando De la Torre,et al.  Facing Imbalanced Data--Recommendations for the Use of Performance Metrics , 2013, 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction.

[3]  Shaoning Pang,et al.  Transductive support vector machines and applications in bioinformatics for promoter recognition , 2003, International Conference on Neural Networks and Signal Processing, 2003. Proceedings of the 2003.

[4]  Doina Caragea,et al.  Prediction of alternatively spliced exons using Support Vector Machines , 2010, Int. J. Data Min. Bioinform..

[5]  John R. Anderson,et al.  MACHINE LEARNING An Artificial Intelligence Approach , 2009 .

[6]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[7]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[8]  Martin Ester,et al.  Sequence analysis PSORTb v . 2 . 0 : Expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis , 2004 .

[9]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[10]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[11]  Xiaojin Zhu,et al.  Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[12]  Gunnar Rätsch,et al.  RASE: recognition of alternatively spliced exons in C.elegans , 2005, ISMB.

[13]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[14]  Doina Caragea,et al.  Predicting cassette exons using transductive learning approaches , 2015, 2015 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB).

[15]  Shaoning Pang,et al.  Inductive vs transductive inference, global vs local models: SVM, TSVM, and SVMT for gene expression classification problems , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[16]  Doina Caragea,et al.  Predicting alternatively spliced exons using semi-supervised learning , 2016, Int. J. Data Min. Bioinform..

[17]  S. Brunak,et al.  Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. , 2000, Journal of molecular biology.

[18]  Xiaojin Zhu,et al.  Semi-Supervised Learning , 2010, Encyclopedia of Machine Learning.

[19]  Alexandre Varnek,et al.  Transductive Support Vector Machines: Promising Approach to Model Small and Unbalanced Datasets , 2013, Molecular informatics.

[20]  Christophe Moulin,et al.  Entropy based feature selection for text categorization , 2011, SAC.

[21]  Jason Weston,et al.  Semi-supervised Protein Classification Using Cluster Kernels , 2003, NIPS.

[22]  Bernhard Schölkopf,et al.  Protein functional class prediction with a combined graph , 2003, Expert Syst. Appl..

[23]  Zoubin Ghahramani,et al.  Learning from labeled and unlabeled data with label propagation , 2002 .

[24]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[25]  O. Griffith,et al.  ALEXA: a microarray design platform for alternative expression analysis , 2008, Nature Methods.

[26]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[27]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[28]  Yong Ren,et al.  Sentiment Classification in Under-Resourced Languages Using Graph-Based Semi-Supervised Learning Methods , 2014, IEICE Trans. Inf. Syst..

[29]  Nic Herndon,et al.  Predicting protein localization using a domain adaptation na¨ıve Bayes classifier with burrows wheeler transform features , 2014, 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[30]  Alexander Gammerman,et al.  Learning by Transduction , 1998, UAI.

[31]  Zhiwen Yu,et al.  Transductive multi-label ensemble classification for protein function prediction , 2012, KDD.

[32]  Koby Crammer,et al.  New Regularized Algorithms for Transductive Learning , 2009, ECML/PKDD.

[33]  Doina Caragea,et al.  Generating Features using Burrows Wheeler Transformation for Biological Sequence Classification , 2014, BIOINFORMATICS.

[34]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[35]  Doina Caragea,et al.  Community detection-based features for sequence classification , 2014, BCB.

[36]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .