Semi-supervised Learning of Text Classification on Bacterial Protein-Protein Interaction Documents

Protein-protein interaction (PPI) network is essential to understand the fundamental processes governing cell biology. The mining and curation of PPI knowledge is critical for analyzing high-throughput genomics and proteomics data. Several PPI knowledge bases have been generated through expensive manual curation but far from comprehensive. It is desired to have a document classification system which can classify documents as PPI-related or not PPI-related and therefore assist the mining and curation of PPI knowledge. In order to build document classification systems, an annotated corpus is needed where each document in the corpus is tagged with a label (either positive or negative). However, it is usually the case that only a small number of positive documents can be obtained manually or from existing PPI knowledge bases with literature evidences. Meanwhile, there are a large number of unlabeled documents where most of them are not PPI-related. Machine learning based on a small number of positives and a large number of unlabeled documents is called learning from positive and unlabelled documents (LPU) which has been studied in the general domain. A popular approach for LPU is a two-step strategy where the first step is to obtain reliable negative documents (RN) and the second step is to refine RN using various methods such as clustering or boosting. In this paper, we tackle the problem of LPU for PPI document classification and compare three two-step procedures based on a public data set, Reuters-21578. One is to obtain a negative data set by building a machine learning classifier which treats each unlabelled document as negatives and then classifies unlabelled documents. The second procedure is to refine the negative data set iteratively and consider those unlabeled documents always classified as negative as reliable negatives. The third procedure is to augment the negative data set iteratively by including unlabeled documents classified as negative in any iteration. Three machine learning algorithms were deployed for each two-step procedure.

[1]  Hasan Davulcu,et al.  IntEx: A Syntactic Role Driven Protein-Protein Interaction Extractor for Bio-Medical Text , 2005, LBLODMBS@IDMB.

[2]  Jian Su,et al.  Protein-Protein Interaction Extraction: A Supervised Learning Approach} , 2005 .

[3]  Desmond J. Higham,et al.  Fitting a geometric graph to a protein-protein interaction network , 2008, Bioinform..

[4]  Anand Kumar,et al.  Text mining and ontologies in biomedicine: Making sense of raw text , 2005, Briefings Bioinform..

[5]  Jiawei Han,et al.  PEBL: Web page classification without negative examples , 2004, IEEE Transactions on Knowledge and Data Engineering.

[6]  Wen-Lian Hsu,et al.  Exploiting likely-positive and unlabeled data to improve the identification of protein-protein interaction articles , 2008, BMC Bioinformatics.

[7]  Anton Yuryev,et al.  Extracting human protein interactions from MEDLINE using a full-sentence parser , 2004, Bioinform..

[8]  Joel D. Martin,et al.  PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine , 2003, BMC Bioinformatics.

[9]  Guixian Xu,et al.  Comparison of classification methods on protein-protein interaction document classification , 2008, 2008 IEEE International Conference on Bioinformatics and Biomeidcine Workshops.

[10]  Xiaoli Li,et al.  Learning to Classify Texts Using Positive and Unlabeled Data , 2003, IJCAI.

[11]  Philip S. Yu,et al.  Partially Supervised Classification of Text Documents , 2002, ICML.

[12]  Philip S. Yu,et al.  Text classification without negative examples revisit , 2006, IEEE Transactions on Knowledge and Data Engineering.

[13]  Philip S. Yu Editorial: State of the Transactions , 2004, IEEE Trans. Knowl. Data Eng..

[14]  Mark R. Gilder,et al.  Extraction of protein interaction information from unstructured text using a context-free grammar , 2003, Bioinform..

[15]  Peter Uetz,et al.  MPIDB: the microbial protein interaction database , 2008, Bioinform..

[16]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..