Web-Scale Distributional Similarity and Entity Set Expansion

Computing the pairwise semantic similarity between all words on the Web is a computationally challenging task. Parallelization and optimizations are necessary. We propose a highly scalable implementation based on distributional similarity, implemented in the MapReduce framework and deployed over a 200 billion word crawl of the Web. The pairwise similarity between 500 million terms is computed in 50 hours using 200 quad-core nodes. We apply the learned similarity matrix to the task of automatic set expansion and present a large empirical study to quantify the effect on expansion performance of corpus size, corpus quality, seed composition and seed size. We make public an experimental testbed for set expansion analysis that includes a large collection of diverse entity sets extracted from Wikipedia.

[1]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[2]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[3]  Robert C. Berwick,et al.  Principle-Based Parsing , 1987 .

[4]  J. Katz,et al.  The philosophy of linguistics , 1989 .

[5]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[6]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[7]  Donald Hindle,et al.  Noun Classification From Predicate-Argument Structures , 1990, ACL.

[8]  Steven Abney,et al.  Parsing By Chunks , 1991 .

[9]  Steven P. Abney,et al.  In Principle Based Parsing , 1991 .

[10]  Pentti Kanerva,et al.  Sparse distributed memory and related models , 1993 .

[11]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[12]  Curt Burgess,et al.  Producing high-dimensional semantic spaces from lexical co-occurrence , 1996 .

[13]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[14]  Ellen Riloff,et al.  A Corpus-Based Approach for Building Semantic Lexicons , 1997, EMNLP.

[15]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[16]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[17]  Lillian Lee,et al.  Measures of Distributional Similarity , 1999, ACL.

[18]  Rie Kubota Ando Latent semantic space: iterative scaling improves precision of inter-document similarity measurement , 2000, SIGIR '00.

[19]  Michele Banko,et al.  Mitigating the Paucity of Data Problem , 2001 .

[20]  Michele Banko,et al.  Mitigating the Paucity-of-Data Problem: Exploring the Effect of Training Corpus Size on Classifier Performance for Natural Language Processing , 2001, HLT.

[21]  Patrick Pantel,et al.  Discovering word senses from text , 2002, KDD.

[22]  Michael L. Littman,et al.  Measuring praise and criticism: Inference of semantic orientation from association , 2003, TOIS.

[23]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[24]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[25]  Eduard Hovy,et al.  Towards terascale knowledge acquisition , 2004, COLING 2004.

[26]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[27]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[28]  Razvan C. Bunescu,et al.  Collective Information Extraction with Relational Markov Networks , 2004, ACL.

[29]  Mirella Lapata,et al.  Web-based models for natural language processing , 2005, TSLP.

[30]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[31]  Patrick Pantel,et al.  Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering , 2005, ACL.

[32]  James R. Curran,et al.  Scaling Distributional Similarity to Large Corpora , 2006, ACL.

[33]  Jeffrey P. Bigham,et al.  Names and Similarities on the Web: Fact Extraction in the Fast Lane , 2006, ACL.

[34]  Hinrich Schütze,et al.  The Effect of Corpus Size in Combining Supervised and Unsupervised Training for Disambiguation , 2006, ACL.

[35]  Katrin Erk,et al.  A Simple, Similarity-based Model for Selectional Preferences , 2007, ACL.

[36]  William W. Cohen,et al.  Language-Independent Set Expansion of Named Entities Using the Web , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[37]  Adam Kilgarriff,et al.  An efficient algorithm for building a distributional thesaurus (and other Sketch Engine developments) , 2007, ACL.

[38]  Marius Pasca,et al.  Weakly-supervised discovery of named entities using web search queries , 2007, CIKM '07.

[39]  Valentin Jijkoun,et al.  "More like these": growing entity classes from seeds , 2007, CIKM '07.

[40]  Marius Pasca,et al.  Organizing and searching the world wide web of facts -- step two: harnessing the wisdom of the crowds , 2007, WWW '07.

[41]  Doug Downey,et al.  Locating Complex Named Entities in Web Text , 2007, IJCAI.

[42]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[43]  Jimmy J. Lin,et al.  Pairwise Document Similarity in Large Collections with MapReduce , 2008, ACL.

[44]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[45]  William W. Cohen,et al.  Iterative Set Expansion of Named Entities Using the Web , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[46]  Benjamin Van Durme,et al.  Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs , 2008, ACL.

[47]  Katrin Erk,et al.  A Structured Vector Space Model for Word Meaning in Context , 2008, EMNLP.

[48]  Enhong Chen,et al.  Context-aware query suggestion by mining click-through and session data , 2008, KDD.

[49]  Eneko Agirre,et al.  A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches , 2009, NAACL.

[50]  Evgeniy Gabrilovich,et al.  Towards intent-driven bidterm suggestion , 2009, WWW '09.

[51]  Mehmet Ali Yatbaz,et al.  The Noisy Channel Model for Unsupervised Word Sense Disambiguation , 2010, Computational Linguistics.