Scaling Distributional Similarity to Large Corpora

Accurately representing synonymy using distributional similarity requires large volumes of data to reliably represent infrequent words. However, the naive nearest-neighbour approach to comparing context vectors extracted from large corpora scales poorly (O(n2) in the vocabulary size).In this paper, we compare several existing approaches to approximating the nearest-neighbour search for distributional similarity. We investigate the trade-off between efficiency and accuracy, and find that SASH (Houle and Sakuma, 2005) provides the best balance.

[1]  Anders Holst,et al.  Random indexing of text samples for latent semantic analysis , 2000 .

[2]  Walter A. Burkhard,et al.  Some approaches to best-match file searching , 1973, Commun. ACM.

[3]  Magnus Sahlgren,et al.  Automatic bilingual lexicon acquisition using random indexing of parallel corpora , 2005, Nat. Lang. Eng..

[4]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[5]  John Beidler,et al.  Data Structures and Algorithms , 1996, Wiley Encyclopedia of Computer Science and Engineering.

[6]  Gregory Grefenstette,et al.  Explorations in automatic thesaurus discovery , 1994 .

[7]  James R. Curran,et al.  Improvements in Automatic Thesaurus Extraction , 2002, ACL 2002.

[8]  Pentti Kanerva,et al.  Sparse distributed memory and related models , 1993 .

[9]  Patrick Pantel,et al.  Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering , 2005, ACL.

[10]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[11]  James R. Curran,et al.  Augmenting Approximate Similarity Searching with Lexical Information , 2005, ALTA.

[12]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[13]  David P. Williamson,et al.  Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming , 1995, JACM.

[14]  Jun Sakuma,et al.  Fast approximate similarity search in extremely high-dimensional data sets , 2005, 21st International Conference on Data Engineering (ICDE'05).

[15]  Patrick Pantel,et al.  Discovering word senses from text , 2002, KDD.

[16]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[17]  James Richard Curran,et al.  From distributional to semantic similarity , 2004 .

[18]  Michael E. Houle,et al.  Navigating massive data sets via local clustering , 2003, KDD '03.

[19]  James R. Curran,et al.  Approximate Searching for Distributional Similarity , 2005, ACL 2005.

[20]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[21]  David J. Weir,et al.  Co-occurrence Retrieval: A Flexible Framework for Lexical Distributional Similarity , 2005, CL.

[22]  Magnus Sahlgren,et al.  From Words to Understanding , 2001 .