Semi-supervised Learning for the BioNLP Gene Regulation Network

BackgroundThe BioNLP Gene Regulation Task has attracted a diverse collection of submissions showcasing state-of-the-art systems. However, a principal challenge remains in obtaining a significant amount of recall. We argue that this is an important quality for Information Extraction tasks in this field. We propose a semi-supervised framework, leveraging a large corpus of unannotated data available to us. In this framework, the annotated data is used to find plausible candidates for positive data points, which are included in the machine learning process. As this is a method principally designed for gaining recall, we further explore additional methods to improve precision on top of this. These are: weighted regularisation in the SVM framework, and filtering out unlabelled examples based on a probabilistic rule-finding method. The latter method also allows us to add candidates for negatives from unlabelled data, a method not viable in the unfiltered approach.ResultsWe replicate one of the original participant systems, and modify it to incorporate our methods. This allows us to test the extent of our proposed methods by applying them to the GRN task data. We find a considerable improvement in recall compared to the baseline system. We also investigate the evaluation metrics and find several mechanisms explaining a bias towards precision. Furthermore, these findings uncover an intricate precision-recall interaction, depriving recall of its habitual immediacy seen in traditional machine learning set-ups.ConclusionOur contributions are twofold:1.An exploration of a novel semi-supervised pipeline. We have succeeded in employing additional knowledge through adding unannotated data points, while responding to the inherent noise of this method by imposing an automated, rule-based pre-selection step.2.A thorough analysis of the evaluation procedure in the Gene Regulation Shared Task. We have performed an in depth inquiry of the Slot Error Rate, responding to arguments that lead to some design choices of this task. We have furthermore uncovered complexities in the interplay of precision and recall that negate the customary behaviour commonplace to the machine learning engineer.

[1]  Andrew McCallum,et al.  A Machine Learning Approach to Building Domain-Specific Search Engines , 1999, IJCAI.

[2]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[3]  Jesse Davis,et al.  Unachievable Region in Precision-Recall Space and Its Effect on Empirical Evaluation , 2012, ICML.

[4]  Xiaojin Zhu,et al.  Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[5]  Oren Etzioni,et al.  Open Information Extraction: The Second Generation , 2011, IJCAI.

[6]  Andrew McCallum,et al.  Relation Extraction with Matrix Factorization and Universal Schemas , 2013, NAACL.

[7]  Xiaohua Hu,et al.  Learning an enriched representation from unlabeled data for protein-protein interaction extraction , 2010, BMC Bioinformatics.

[8]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[9]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[10]  Sampo Pyysalo,et al.  Medie and Info-pubmed: 2010 update , 2010, BMC Bioinformatics.

[11]  Sophia Ananiadou,et al.  Boosting automatic event extraction from the literature using domain adaptation and coreference resolution , 2012, Bioinform..

[12]  Marinka Zitnik,et al.  Extracting Gene Regulation Networks Using Linear-Chain Conditional Random Fields and Rules , 2013, BioNLP@ACL.

[13]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[14]  Publisher Bioinfo Publications Journal of Machine Learning Technologies , 2013 .

[15]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[16]  Xiaojin Zhu,et al.  Semi-Supervised Learning , 2010, Encyclopedia of Machine Learning.

[17]  Rong Jin,et al.  Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization , 2008, NIPS.

[18]  Ulf Leser,et al.  Learning Protein–Protein Interaction Extraction using Distant Supervision , 2011 .

[19]  Marie-Francine Moens,et al.  The latent words language model , 2012, Comput. Speech Lang..

[20]  Alex A. Freitas,et al.  A survey of hierarchical classification across different application domains , 2010, Data Mining and Knowledge Discovery.

[21]  Ralph Weischedel,et al.  PERFORMANCE MEASURES FOR INFORMATION EXTRACTION , 2007 .

[22]  Roger K. Moore Computer Speech and Language , 1986 .

[23]  Razvan C. Bunescu,et al.  Learning to Extract Relations from the Web using Minimal Supervision , 2007, ACL.

[24]  Sampo Pyysalo,et al.  Overview of BioNLP Shared Task 2013 , 2013, BioNLP@ACL.

[25]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[26]  Yi Yang,et al.  Learning Representations for Weakly Supervised Natural Language Processing Tasks , 2014, CL.

[27]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[28]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[29]  Sander R Piersma,et al.  Whole gel processing procedure for GeLC-MS/MS based proteomics , 2013, Proteome Science.

[30]  Andrew Y. Ng,et al.  Parsing with Compositional Vector Grammars , 2013, ACL.

[31]  Robert Bossy,et al.  Overview of the gene regulation network and the bacteria biotope tasks in BioNLP'13 shared task , 2015, BMC Bioinformatics.

[32]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[33]  Marie-Francine Moens,et al.  Detecting Relations in the Gene Regulation Network , 2013, BioNLP@ACL.

[34]  Qian Xu,et al.  Semi-supervised method for biomedical event extraction , 2013, Proteome Science.

[35]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[36]  Nello Cristianini,et al.  Controlling the Sensitivity of Support Vector Machines , 1999 .

[37]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[38]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..