Fast Large-Scale Approximate Graph Construction for NLP

Many natural language processing problems involve constructing large nearest-neighbor graphs. We propose a system called FLAG to construct such graphs approximately from large data sets. To handle the large amount of data, our algorithm maintains approximate counts based on sketching algorithms. To find the approximate nearest neighbors, our algorithm pairs a new distributed online-PMI algorithm with novel fast approximate nearest neighbor search algorithms (variants of Pleb). These algorithms return the approximate nearest neighbors quickly. We show our system's efficiency in both intrinsic and extrinsic experiments. We further evaluate our fast search algorithms both quantitatively and qualitatively on two NLP applications.

[1]  Katherine A. Heller,et al.  Bayesian Sets , 2005, NIPS.

[2]  Cristian Estan,et al.  New directions in traffic measurement and accounting , 2001, IMW '01.

[3]  Florin Rusu,et al.  Statistical analysis of sketch estimators , 2007, SIGMOD '07.

[4]  C. W. Metz,et al.  Mutations in Two Species of Drosophila , 1915, The American Naturalist.

[5]  C. B. Bridges,et al.  PARTIAL SEX-LINKAGE IN THE PIGEON. , 1913, Science.

[6]  Ashwin Lall,et al.  Online Generation of Locality Sensitive Hash Signatures , 2010, ACL.

[7]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[8]  Thomas Hunt Morgan,et al.  The mechanism of Mendelian heredity , 1915 .

[9]  Ellen Riloff,et al.  Semantic Class Learning from the Web with Hyponym Pattern Linkage Graphs , 2008, ACL.

[10]  Patrick Pantel,et al.  Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering , 2005, ACL.

[11]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[12]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[13]  Kenneth Ward Church,et al.  Very sparse random projections , 2006, KDD '06.

[14]  Suresh Venkatasubramanian,et al.  Streaming for large scale NLP: Language Modeling , 2009, NAACL.

[15]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[16]  Graham Cormode,et al.  Sketch Algorithms for Estimating Point Queries in NLP , 2012, EMNLP.

[17]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[18]  Yair Neuman,et al.  Literal and Metaphorical Sense Identification through Concrete and Abstract Context , 2011, EMNLP.

[19]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[20]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[21]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[22]  James R. Curran,et al.  Weighted Mutual Exclusion Bootstrapping for Domain Independent Lexicon and Template Acquisition , 2008, ALTA.

[23]  Ashwin Lall,et al.  Efficient Online Locality Sensitive Hashing via Reservoir Counting , 2011, ACL.

[24]  Ellen Riloff,et al.  A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts , 2002, EMNLP.

[25]  L. Doncaster On an inherited tendency to produce purely female families inAbraxas grossulariata, and its relation to an abnormal chromosome number , 2008, Journal of Genetics.

[26]  Sasha Blair-Goldensohn,et al.  The viability of web-derived polarity lexicons , 2010, NAACL.

[27]  Slav Petrov,et al.  Unsupervised Part-of-Speech Tagging with Bilingual Graph-Based Projections , 2011, ACL.

[28]  Kenneth Ward Church,et al.  One sketch for all: Theory and Application of Conditional Random Sampling , 2008, NIPS.

[29]  George Varghese,et al.  New directions in traffic measurement and accounting , 2002, CCRV.

[30]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[31]  Eric Crestan,et al.  Web-Scale Distributional Similarity and Entity Set Expansion , 2009, EMNLP.

[32]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[33]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[34]  Hal Daumé,et al.  Approximate Scalable Bounded Space Sketch for Large Data NLP , 2011, EMNLP.

[35]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[36]  Delip Rao,et al.  Semi-Supervised Polarity Lexicon Induction , 2009, EACL.

[37]  Ashwin Lall,et al.  Streaming Pointwise Mutual Information , 2009, NIPS.

[38]  Eneko Agirre,et al.  A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches , 2009, NAACL.

[39]  Chris Callison-Burch,et al.  Stream-based Translation Models for Statistical Machine Translation , 2010, NAACL.

[40]  Hermann J. Muller,et al.  A New Mode of Segregation in Gregory's Tetraploid Primulas , 1914, The American Naturalist.