Combining Labelled and Unlabelled Data: A Case Study on Fisher Kernels and Transductive Inference for Biological Entity Recognition

We address the problem of using partially labelled data, eg large collections were only little data is annotated, for extracting biological entities. Our approach relies on a combination of probabilistic models, which we use to model the generation of entities and their context, and kernel machines, which implement powerful categorisers based on a similarity measure and some labelled data. This combination takes the form of the so-called Fisher kernels which implement a similarity based on an underlying probabilistic model. Such kernels are compared with transductive inference, an alternative approach to combining labelled and unlabelled data, again coupled with Support Vector Machines. Experiments are performed on a database of abstracts extracted from Medline.

[1]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[2]  Kris Popat,et al.  A Hierarchical Model for Clustering and Categorising Documents , 2002, ECIR.

[3]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[4]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[5]  Denys Proux,et al.  A Pragmatic Information Extraction Strategy for Gathering Data on Genetic Interactions , 2000, ISMB.

[6]  Thomas O. McGarity,et al.  The hierarchical model , 1991 .

[7]  Thomas Hofmann,et al.  Text classification in a hierarchical mixture model for small training sets , 2001, CIKM '01.

[8]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[9]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[10]  Thomas Hofmann,et al.  Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization , 1999, NIPS.

[11]  C. Friedman,et al.  Using BLAST for identifying gene and protein names in journal articles. , 2000, Gene.

[12]  C. Ouzounis,et al.  Automatic extraction of protein interactions from scientific abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[13]  Yasunori Yamamoto,et al.  Automatic Construction of Knowledge Base from Biological Papers , 1997, ISMB.

[14]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[15]  Alexander Gammerman,et al.  Learning by Transduction , 1998, UAI.

[16]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[17]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .