Guiding Corpus-based Set Expansion by Auxiliary Sets Generation and Co-Expansion

Given a small set of seed entities (e.g., “USA”, “Russia”), corpus-based set expansion is to induce an extensive set of entities which share the same semantic class (Country in this example) from a given corpus. Set expansion benefits a wide range of downstream applications in knowledge discovery, such as web search, taxonomy construction, and query suggestion. Existing corpus-based set expansion algorithms typically bootstrap the given seeds by incorporating lexical patterns and distributional similarity. However, due to no negative sets provided explicitly, these methods suffer from semantic drift caused by expanding the seed set freely without guidance. We propose a new framework, Set-CoExpan, that automatically generates auxiliary sets as negative sets that are closely related to the target set of user’s interest, and then performs multiple sets co-expansion that extracts discriminative features by comparing target set with auxiliary sets, to form multiple cohesive sets that are distinctive from one another, thus resolving the semantic drift issue. In this paper we demonstrate that by generating auxiliary sets, we can guide the expansion process of target set to avoid touching those ambiguous areas around the border with auxiliary sets, and we show that Set-CoExpan outperforms strong baseline methods significantly.

[1]  Jason Weston,et al.  Translating Embeddings for Modeling Multi-relational Data , 2013, NIPS.

[2]  Zhiyuan Liu,et al.  Learning Entity and Relation Embeddings for Knowledge Graph Completion , 2015, AAAI.

[3]  Chao Zhang,et al.  FUSE: Multi-Faceted Set Expansion by Coherent Clustering of Skip-grams , 2019, ECML/PKDD.

[4]  Christopher D. Manning,et al.  Improved Pattern Learning for Bootstrapped Entity Extraction , 2014, CoNLL.

[5]  Jennifer Chu-Carroll,et al.  Question Answering Using Constraint Satisfaction: QA-By-Dossier-With-Contraints , 2004, ACL.

[6]  Ellen Riloff,et al.  A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts , 2002, EMNLP.

[7]  Zhen Wang,et al.  Knowledge Graph Embedding by Translating on Hyperplanes , 2014, AAAI.

[8]  Stefano Faralli,et al.  OntoLearn Reloaded: A Graph-Based Algorithm for Taxonomy Induction , 2013, CL.

[9]  James Allan,et al.  Corpus-based Set Expansion with Lexical Features and Distributed Representations , 2019, SIGIR.

[10]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[11]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[12]  Jeffrey Heer,et al.  Research and applications: Induced lexico-syntactic patterns improve information extraction from online medical forums , 2014, J. Am. Medical Informatics Assoc..

[13]  William W. Cohen,et al.  Automatic Set Expansion for List Question Answering , 2008, EMNLP.

[14]  Chao Zhang,et al.  Discriminative Topic Mining via Category-Name Guided Text Embedding , 2020, WWW.

[15]  Zhe Chen,et al.  Long-tail Vocabulary Dictionary Extraction from the Web , 2016, WSDM.

[16]  Dan Roth,et al.  Learning from Negative Examples in Set-Expansion , 2011, 2011 IEEE 11th International Conference on Data Mining.

[17]  Lior Rokach,et al.  Clustering Methods , 2005, The Data Mining and Knowledge Discovery Handbook.

[18]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[19]  Maurizio Atzori,et al.  Unsupervised Singleton Expansion from Free Text , 2018, 2018 IEEE 12th International Conference on Semantic Computing (ICSC).

[20]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[21]  Jiawei Han,et al.  SetExpan: Corpus-Based Set Expansion via Context Feature Selection and Rank Ensemble , 2017, ECML/PKDD.

[22]  Patrick Pantel,et al.  Semi-Automatic Entity Set Refinement , 2009, NAACL.

[23]  Chaudhuri,et al.  Analogies Explained: Towards Understanding Word Embeddings , .

[24]  Zhe Chen,et al.  EgoSet: Exploiting Word Ego-networks and User-generated Ontology for Multifaceted Set Expansion , 2016, WSDM.

[25]  SoderlandStephen,et al.  Unsupervised named-entity extraction from the Web , 2005 .

[26]  Ralph Grishman,et al.  Bootstrapped Learning of Semantic Classes from Positive and Negative Examples , 2003 .

[27]  Moshe Wasserblat,et al.  Term Set Expansion based NLP Architect by Intel AI Lab , 2018, EMNLP.

[28]  Enhong Chen,et al.  Context-aware query suggestion by mining click-through and session data , 2008, KDD.