Automatic term list generation for entity tagging

MOTIVATION Many entity taggers and information extraction systems make use of lists of terms of entities such as people, places, genes or chemicals. These lists have traditionally been constructed manually. We show that distributional clustering methods which group words based on the contexts that they appear in, including neighboring words and syntactic relations extracted using a shallow parser, can be used to aid in the construction of term lists. RESULTS Experiments on learning lists of terms and using them as part of a gene tagger on a corpus of abstracts from the scientific literature show that our automatically generated term lists significantly boost the precision of a state-of-the-art CRF-based gene tagger to a degree that is competitive with using hand curated lists and boosts recall to a degree that surpasses that of the hand-curated lists. Our results also show that these distributional clustering methods do not generate lists as helpful as those generated by supervised techniques, but that they can be used to complement supervised techniques so as to obtain better performance. AVAILABILITY The code used in this paper is available from http://www.cis.upenn.edu/datamining/software_dist/autoterm/

[1]  Ellen Riloff,et al.  A Corpus-Based Approach for Building Semantic Lexicons , 1997, EMNLP.

[2]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[3]  Donald Hindle,et al.  Noun Classification From Predicate-Argument Structures , 1990, ACL.

[4]  Sue Povey,et al.  Genew: the Human Gene Nomenclature Database, 2004 updates , 2004, Nucleic Acids Res..

[5]  Brian Roark,et al.  Noun-Phrase Co-Occurence Statistics for Semi-Automatic Semantic Lexicon Construction , 1998, COLING-ACL.

[6]  Alfonso Valencia,et al.  Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.

[7]  Fernando Pereira,et al.  Identifying gene and protein mentions in text using conditional random fields , 2005, BMC Bioinformatics.

[8]  Anita Marchfelder,et al.  Drosophila RNase Z processes mitochondrial and nuclear pre-tRNA 3' ends in vivo. , 2004, Nucleic acids research.

[9]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[10]  Lorraine K. Tanabe,et al.  Tagging gene and protein names in full text articles , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[11]  Dayne Freitag,et al.  Boosted Wrapper Induction , 2000, AAAI/IAAI.

[12]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[13]  Alexander A. Morgan,et al.  Rutabaga by any other name: extracting biological names , 2002, J. Biomed. Informatics.

[14]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[15]  Dekang Lin,et al.  Dependency-Based Evaluation of Minipar , 2003 .

[16]  Sue Povey,et al.  Genew: the Human Gene Nomenclature Database , 2002, Nucleic Acids Res..

[17]  Dayne Freitag,et al.  Trained Named Entity Recognition using Distributional Clusters , 2004, EMNLP.

[18]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[19]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[20]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[21]  Patrick Pantel,et al.  Induction of semantic classes from natural language text , 2001, KDD '01.

[22]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[23]  Scott Miller,et al.  Name Tagging with Word Clusters and Discriminative Training , 2004, NAACL.

[24]  Lorraine K. Tanabe,et al.  Generation of a Large Gene/protein Lexicon by Morphological Pattern Analysis , 2004, J. Bioinform. Comput. Biol..