论文信息 - Scaling Distributional Similarity to Large Corpora

Scaling Distributional Similarity to Large Corpora

Accurately representing synonymy using distributional similarity requires large volumes of data to reliably represent infrequent words. However, the naive nearest-neighbour approach to comparing context vectors extracted from large corpora scales poorly (O(n2) in the vocabulary size).In this paper, we compare several existing approaches to approximating the nearest-neighbour search for distributional similarity. We investigate the trade-off between efficiency and accuracy, and find that SASH (Houle and Sakuma, 2005) provides the best balance.

James R. Curran | James Gorman | J. Curran | James Gorman

[1] Anders Holst,et al. Random indexing of text samples for latent semantic analysis , 2000 .

[2] Walter A. Burkhard,et al. Some approaches to best-match file searching , 1973, Commun. ACM.

[3] Magnus Sahlgren,et al. Automatic bilingual lexicon acquisition using random indexing of parallel corpora , 2005, Nat. Lang. Eng..

[4] Peter N. Yianilos,et al. Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[5] John Beidler,et al. Data Structures and Algorithms , 1996, Wiley Encyclopedia of Computer Science and Engineering.

[6] Gregory Grefenstette,et al. Explorations in automatic thesaurus discovery , 1994 .

[7] James R. Curran,et al. Improvements in Automatic Thesaurus Extraction , 2002, ACL 2002.

[8] Pentti Kanerva,et al. Sparse distributed memory and related models , 1993 .

[9] Patrick Pantel,et al. Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering , 2005, ACL.

[10] Piotr Indyk,et al. Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[11] James R. Curran,et al. Augmenting Approximate Similarity Searching with Lexical Information , 2005, ALTA.

[12] Moses Charikar,et al. Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[13] David P. Williamson,et al. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming , 1995, JACM.

[14] Jun Sakuma,et al. Fast approximate similarity search in extremely high-dimensional data sets , 2005, 21st International Conference on Data Engineering (ICDE'05).

[15] Patrick Pantel,et al. Discovering word senses from text , 2002, KDD.

[16] Andrei Z. Broder,et al. On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[17] James Richard Curran,et al. From distributional to semantic similarity , 2004 .

[18] Michael E. Houle,et al. Navigating massive data sets via local clustering , 2003, KDD '03.

[19] James R. Curran,et al. Approximate Searching for Distributional Similarity , 2005, ACL 2005.

[20] T. Landauer,et al. A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[21] David J. Weir,et al. Co-occurrence Retrieval: A Flexible Framework for Lexical Distributional Similarity , 2005, CL.

[22] Magnus Sahlgren,et al. From Words to Understanding , 2001 .