Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering

In this paper, we explore the power of randomized algorithm to address the challenge of working with very large amounts of data. We apply these algorithms to generate noun similarity lists from 70 million pages. We reduce the running time from quadratic to practically linear in the number of elements to be computed.

[1]  David P. Williamson,et al.  Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming , 1995, JACM.

[2]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[3]  Dekang Lin,et al.  PRINCIPAR - An Efficient, Broad-coverage, Principle-based Parser , 1994, COLING.

[4]  Patrick Pantel,et al.  Discovering word senses from text , 2002, KDD.

[5]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[6]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[7]  Joshua Alspector,et al.  Improved robustness of signature-based near-replica detection via lexicon randomization , 2004, KDD.

[8]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[9]  Ericka Stricklin-Parker,et al.  Ann , 2005 .

[10]  James R. Curran,et al.  Scaling Context Space , 2002, ACL.

[11]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[12]  N. Meyers,et al.  H = W. , 1964, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Donald Hindle,et al.  Noun Classification From Predicate-Argument Structures , 1990, ACL.

[14]  Michele Banko,et al.  Mitigating the Paucity-of-Data Problem: Exploring the Effect of Training Corpus Size on Classifier Performance for Natural Language Processing , 2001, HLT.

[15]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[16]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.