A Framework for Semisupervised Feature Generation and Its Applications in Biomedical Literature Mining

Feature representation is essential to machine learning and text mining. In this paper, we present a feature coupling generalization (FCG) framework for generating new features from unlabeled data. It selects two special types of features, i.e., example-distinguishing features (EDFs) and class-distinguishing features (CDFs) from original feature set, and then generalizes EDFs into higher-level features based on their coupling degrees with CDFs in unlabeled data. The advantage is: EDFs with extreme sparsity in labeled data can be enriched by their co-occurrences with CDFs in unlabeled data so that the performance of these low-frequency features can be greatly boosted and new information from unlabeled can be incorporated. We apply this approach to three tasks in biomedical literature mining: gene named entity recognition (NER), protein-protein interaction extraction (PPIE), and text classification (TC) for gene ontology (GO) annotation. New features are generated from over 20 GB unlabeled PubMed abstracts. The experimental results on BioCreative 2, AIMED corpus, and TREC 2005 Genomics Track show that 1) FCG can utilize well the sparse features ignored by supervised learning. 2) It improves the performance of supervised baselines by 7.8 percent, 5.0 percent, and 5.8 percent, respectively, in the tree tasks. 3) Our methods achieve 89.1, 64.5 F-score, and 60.1 normalized utility on the three benchmark data sets.

[1]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[2]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[3]  Graciela Gonzalez,et al.  BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition , 2007, Pacific Symposium on Biocomputing.

[4]  Malvina Nissim,et al.  Exploring the boundaries: gene and protein identification in biomedical text , 2005, BMC Bioinformatics.

[5]  Jun'ichi Tsujii,et al.  Evaluating contributions of natural language parsers to protein–protein interaction extraction , 2008, Bioinform..

[6]  Rie Kubota Ando,et al.  BioCreative II Gene Mention Tagging System at IBM Watson , 2007 .

[7]  Chun-Nan Hsu,et al.  Integrating high dimensional bi-directional parsing models for gene mention tagging , 2008, ISMB.

[8]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[9]  Jun'ichi Tsujii,et al.  A Rich Feature Vector for Protein-Protein Interaction Extraction from Multiple Corpora , 2009, EMNLP.

[10]  Hongfang Liu,et al.  BioThesaurus: a web-based thesaurus of protein and gene names , 2006, Bioinform..

[11]  Xiaohua Hu,et al.  Learning an enriched representation from unlabeled data for protein-protein interaction extraction , 2010, BMC Bioinformatics.

[12]  Koby Crammer,et al.  Penn/Umass/CHOP Biocreative II systems , 2007 .

[13]  William R. Hersh,et al.  TREC GENOMICS Track Overview , 2003, TREC.

[14]  Marti A. Hearst,et al.  TREC 2007 Genomics Track Overview , 2007, TREC.

[15]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[16]  Jun'ichi Tsujii,et al.  Combining Multiple Layers of Syntactic Information for Protein-Protein Interaction Extraction , 2008 .

[17]  C. Oliveira Splitting the Unsupervised and Supervised Components of Semi-Supervised Learning , 2005 .

[18]  Rohit J. Kate,et al.  Comparative experiments on learning information extractors for proteins and their interactions , 2005, Artif. Intell. Medicine.

[19]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[20]  Hongfei Lin,et al.  Incorporating rich background knowledge for gene named entity classification and recognition , 2009, BMC Bioinformatics.

[21]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[22]  WIM at TREC 2005 , 2005, TREC.

[23]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[24]  Andrew McCallum,et al.  Efficiently Inducing Features of Conditional Random Fields , 2002, UAI.

[25]  Sougata Mukherjea,et al.  Biomedical Document Triage: Automatic Classification Exploiting Category Specific Knowledge , 2005, TREC.

[26]  Jari Björne,et al.  Comparative analysis of five protein-protein interaction corpora , 2008, BMC Bioinformatics.

[27]  Lorraine K. Tanabe,et al.  Generation of a Large Gene/protein Lexicon by Morphological Pattern Analysis , 2004, J. Bioinform. Comput. Biol..

[28]  Hongfei Lin,et al.  TREC 2005 Genomics Track Experiments at DUTAI , 2005, TREC.

[29]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..