Semi-supervised Learning of Alternatively Spliced Exons Using Co-training

Alternative splicing is a phenomenon that gives rise to multiple mRNA transcripts from a single gene. It is believed that a large number of genes undergoes alternative splicing. Predicting alternative splicing events is a problem of great interest, as it can help the understanding of transcript diversity. Supervised machine learning approaches can be used to predict alternative splicing events at genome level. However, supervised approaches require large amounts of labeled data to learn accurate classifiers. While large amounts of genomic data are produced by the new sequencing technologies, labeling these data can be costly and time consuming. Therefore, semi-supervised learning approaches that can make use of large amounts of unlabeled data, in addition to small amounts of labeled data are highly desirable. In this work, we study the usefulness of a semi-supervised learning approach, co-training, for classifying exons as alternatively spliced or constitutive. The co-training algorithm makes use of two views of the data to iteratively learn two classifiers that can inform each other, at each step, with their best predictions on the unlabeled data. We consider two sets of features for constructing views for the problem of predicting alternatively spliced exons: exonic splicing enhancers and intronic regulatory sequences. We use the Naive Bayes Multinomial algorithm as a base classifier in our study. Experimental results show that the usage of the unlabeled data can result in better classifiers as compared to those obtained from the small amount of labeled data alone.

[1]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[2]  Jason Weston,et al.  Semi-supervised Protein Classification Using Cluster Kernels , 2003, NIPS.

[3]  Jun Du,et al.  When Does Cotraining Work in Real Data? , 2011, IEEE Transactions on Knowledge and Data Engineering.

[4]  Jun Du,et al.  When does Co-training Work in Real Data? , 2009, PAKDD.

[5]  Rayid Ghani,et al.  Analyzing the effectiveness and applicability of co-training , 2000, CIKM '00.

[6]  Gunnar Rätsch,et al.  RASE: recognition of alternatively spliced exons in C.elegans , 2005, ISMB.

[7]  Zhi-Hua Zhou,et al.  Improve Computer-Aided Diagnosis With Machine Learning Techniques Using Undiagnosed Samples , 2007, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[8]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[9]  Jason Weston,et al.  Protein Ranking by Semi-Supervised Network Propagation , 2006, BMC Bioinformatics.

[10]  Qiang Yang,et al.  Semi-supervised protein subcellular localization , 2009, BMC Bioinformatics.

[11]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[12]  Hiram Clawson,et al.  Intronic Alternative Splicing Regulators Identified by Comparative Genomics in Nematodes , 2006, PLoS Comput. Biol..

[13]  Xiaojin Zhu,et al.  Seeing stars when there aren’t many stars: Graph-based semi-supervised learning for sentiment categorization , 2006 .

[14]  Yoram Singer,et al.  Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[15]  Doina Caragea,et al.  Prediction of alternatively spliced exons using Support Vector Machines , 2010, Int. J. Data Min. Bioinform..

[16]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.