Reducing Semantic Drift with Bagging and Distributional Similarity

Iterative bootstrapping algorithms are typically compared using a single set of hand-picked seeds. However, we demonstrate that performance varies greatly depending on these seeds, and favourable seeds for one algorithm can perform very poorly with others, making comparisons unreliable. We exploit this wide variation with bagging, sampling from automatically extracted seeds to reduce semantic drift. However, semantic drift still occurs in later iterations. We propose an integrated distributional similarity filter to identify and censor potential semantic drifts, ensuring over 10% higher precision when extracting large semantic lexicons.

[1]  Claire Grover,et al.  Tools to Address the Interdependence between Tokenisation and Standoff Annotation , 2006, NLPXML@EACL.

[2]  Patrick Pantel,et al.  Automatically Labeling Semantic Classes , 2004, NAACL.

[3]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[4]  James R. Curran,et al.  Weighted Mutual Exclusion Bootstrapping for Domain Independent Lexicon and Template Acquisition , 2008, ALTA.

[5]  Ellen Riloff,et al.  A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts , 2002, EMNLP.

[6]  Steve Renals,et al.  Proceedings of the Ninth Text REtrieval Conference , 2001 .

[7]  P. Pantel,et al.  A Bootstrapping Algorithm for Automatically Harvesting Semantic Relations , 2006, Proceedings of the Fifth International Workshop on Inference in Computational Semantics.

[8]  Edgar Meij,et al.  Bootstrapping Language Associated with Biomedical Entities The AID Group at TREC Genomics 2007 , 2007 .

[9]  Roman Yangarber,et al.  Counter-Training in Discovery of Semantic Patterns , 2003, ACL.

[10]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[11]  Gregory Grefenstette,et al.  Explorations in automatic thesaurus discovery , 1994 .

[12]  Damianos Karakos,et al.  Bootstrapping Without the Boot , 2005, HLT.

[13]  Marti A. Hearst,et al.  TREC 2007 Genomics Track Overview , 2007, TREC.

[14]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[15]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[16]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[17]  Edgar Meij,et al.  Bootstrapping Language Associated with Biomedical Entities , 2007, TREC.

[18]  Yuji Matsumoto,et al.  Graph-based Analysis of Semantic Drift in Espresso-like Bootstrapping Algorithms , 2008, EMNLP.

[19]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[20]  James Richard Curran,et al.  From distributional to semantic similarity , 2004 .

[21]  Ido Dagan,et al.  Integrating Pattern-Based and Distributional Similarity Methods for Lexical Entailment Acquisition , 2006, ACL.

[22]  Hong Yu,et al.  Extracting synonymous gene and protein terms from biological literature , 2003, ISMB.

[23]  Claire Cardie,et al.  Weakly Supervised Natural Language Learning Without Redundant Views , 2003, NAACL.

[24]  J. Curran,et al.  Minimising semantic drift with Mutual Exclusion Bootstrapping , 2007 .

[25]  Jian Su,et al.  Coreference Resolution Using Semantic Relatedness Information from Automatically Discovered Patterns , 2007, ACL.

[26]  Ellen Riloff,et al.  Learning subjective nouns using extraction pattern bootstrapping , 2003, CoNLL.

[27]  Jeffrey P. Bigham,et al.  Names and Similarities on the Web: Fact Extraction in the Fast Lane , 2006, ACL.