Large Scale Knowledge Matching with Balanced Efficiency-Effectiveness Using LSH Forest

Evolving Knowledge Ecosystems were proposed to approach the Big Data challenge, following the hypothesis that knowledge evolves in a way similar to biological systems. Therefore, the inner working of the knowledge ecosystem can be spotted from natural evolution. An evolving knowledge ecosystem consists of Knowledge Organisms, which form a representation of the knowledge, and the environment in which they reside. The environment consists of contexts, which are composed of so-called knowledge tokens. These tokens are ontological fragments extracted from information tokens, in turn, which originate from the streams of information flowing into the ecosystem. In this article we investigate the use of LSH Forest (a self-tuning indexing schema based on locality-sensitive hashing) for solving the problem of placing new knowledge tokens in the right contexts of the environment. We argue and show experimentally that LSH Forest possesses required properties and could be used for large distributed set-ups. Further, we show experimentally that for our type of data minhashing works better than random hyperplane hashing. This paper is an extension of the paper “Balanced Large Scale Knowledge Matching Using LSH Forest” presented at the International Keystone Conference 2015.

[1]  Vadim Ermolayev,et al.  Agent-based ontology alignment: basics, applications, theoretical foundations, and demonstration , 2012, WIMS '12.

[2]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[3]  Mayank Bawa,et al.  LSH forest: self-tuning indexes for similarity search , 2005, WWW '05.

[4]  Michael Cochez,et al.  Twister Tries: Approximate Hierarchical Agglomerative Clustering for Average Distance in Linear Time , 2015, SIGMOD Conference.

[5]  Rajendra Akerkar,et al.  Toward Evolving Knowledge Ecosystems for Big Data Understanding , 2013 .

[6]  A. Broder Some applications of Rabin’s fingerprinting method , 1993 .

[7]  Michael Cochez Locality-Sensitive Hashing for Massive String-Based Ontology Matching , 2014, 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT).

[8]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[9]  Vadim Ermolayev,et al.  Balanced Large Scale Knowledge Matching Using LSH Forest , 2015, International KEYSTONE Conference.

[10]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[11]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[12]  Zhe Wang,et al.  Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search , 2007, VLDB.

[13]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[14]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).