BioSGAN: Protein-Phenotype Co-mention Classification Using Semi-Supervised Generative Adversarial Networks

Valuable and relevant information that relates human proteins with their phenotypes in biomedical literature stays hidden from biomedical scientists due to the rapid rise in biomedical publications. Previous studies that developed computational methods to extract this knowledge mostly rely on rule-based linguistic patterns and supervised machine learning approaches. In this work, we propose the use of generative adversarial networks to develop a novel method called BioSGAN for the protein-phenotype co-mention classification task. We demonstrate the potential associated with combining a small labeled dataset with vast unlabelled biomedical text data extracted from Medline abstracts and PubMed Central open Access full-text in a semi-supervised machine learning framework. Our method achieves state-of-the-art performance for classifying the validity of a given sentence-level co-mention of a human protein and phenotype by convincingly outperforming a traditional machine learning-based counterpart. These findings have implications for biocurators, researchers, and the text mining community involved with biomedical relation extraction.