A locality-aware similar information searching scheme

In a database, a similar information search means finding data records which contain the majority of search keywords. Due to the rapid accumulation of information nowadays, the size of databases has increased dramatically. An efficient information searching scheme can speed up information searching and retrieve all relevant records. This paper proposes a Hilbert curve-based similarity searching scheme (HCS). HCS considers a database to be a multidimensional space and each data record to be a point in the multidimensional space. By using a Hilbert space filling curve, each point is projected from a high-dimensional space to a low-dimensional space, so that the points close to each other in the high-dimensional space are gathered together in the low-dimensional space. Because the database is divided into many clusters of close points, a query is mapped to a certain cluster instead of searching the entire database. Experimental results prove that HCS dramatically reduces the search time latency and exhibits high effectiveness in retrieving similar information.

[1]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[2]  John J. Bartholdi,et al.  Vertex‐labeling algorithms for the Hilbert spacefilling curve , 2001, Softw. Pract. Exp..

[3]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[4]  Mario A. Nascimento,et al.  High-Dimensional Similarity Searches Using A Metric Pseudo-Grid , 2005, 21st International Conference on Data Engineering Workshops (ICDEW'05).

[5]  David M. Mark,et al.  A Comparative Analysis of some 2-Dimensional Orderings , 1990, Int. J. Geogr. Inf. Sci..

[6]  Dario Maio,et al.  A structural approach to fingerprint classification , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[7]  H. V. Jagadish,et al.  Linear clustering of objects with multiple attributes , 1990, SIGMOD '90.

[8]  Ronald Fagin,et al.  Efficient similarity search and classification via rank aggregation , 2003, SIGMOD '03.

[9]  Ratko Orlandic,et al.  High-Dimensional Similarity Search Using Data-Sensitive Space Partitioning , 2006, DEXA.

[10]  Ramesh C. Jain,et al.  Similarity indexing with the SS-tree , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[11]  Christos Faloutsos,et al.  Analysis of the Clustering Properties of the Hilbert Space-Filling Curve , 2001, IEEE Trans. Knowl. Data Eng..

[12]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[13]  Jeremy Buhler,et al.  Efficient large-scale sequence comparison by locality-sensitive hashing , 2001, Bioinform..

[14]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[15]  Hefei Ling,et al.  Least square regularized spectral hashing for similarity search , 2013, Signal Process..

[16]  Simone Santini,et al.  Beyond query by example , 1998, 1998 IEEE Second Workshop on Multimedia Signal Processing (Cat. No.98EX175).

[17]  Mario A. López,et al.  High dimensional similarity search with space filling curves , 2001, Proceedings 17th International Conference on Data Engineering.

[18]  Ophir Frieder,et al.  Information Retrieval: Algorithms and Heuristics (The Kluwer International Series on Information Retrieval) , 2004 .

[19]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[20]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[21]  Marios Hadjieleftheriou,et al.  R-Trees - A Dynamic Index Structure for Spatial Searching , 2008, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.

[22]  Jeremy Buhler,et al.  Large-Scale Sequence Comparison by Locality-Sensitive Hashing , 2001 .

[23]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[24]  Ori Sasson,et al.  Non-Expansive Hashing , 1996, STOC '96.

[25]  H. Sagan Space-filling curves , 1994 .

[26]  Charu C. Aggarwal Hierarchical subspace sampling: a unified framework for high dimensional data reduction, selectivity estimation and nearest neighbor search , 2002, SIGMOD '02.

[27]  Edward Y. Chang,et al.  Clustering for Approximate Similarity Search in High-Dimensional Spaces , 2002, IEEE Trans. Knowl. Data Eng..

[28]  Kotagiri Ramamohanarao,et al.  Inverted files versus signature files for text indexing , 1998, TODS.

[29]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[30]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[31]  Christos Faloutsos,et al.  Hilbert R-tree: An Improved R-tree using Fractals , 1994, VLDB.

[32]  Marco Patella,et al.  The many facets of approximate similarity search , 2008, ICDE Workshops.

[33]  Avelino J. Gonzalez,et al.  Data-partitioning using the Hilbert space filling curves: Effect on the speed of convergence of Fuzzy ARTMAP for large database problems , 2005, Neural Networks.

[34]  Christian Digout,et al.  Metric Techniques for High-Dimensional Indexing , 2004 .

[35]  Ada Wai-Chee Fu,et al.  Dynamic vp-tree indexing for n-nearest neighbor search given pair-wise distances , 2000, The VLDB Journal.

[36]  Bart Thomee,et al.  New trends and ideas in visual concept detection: the MIR flickr retrieval evaluation initiative , 2010, MIR '10.

[37]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[38]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[39]  Walid G. Aref,et al.  Irregularity in multi-dimensional space-filling curves with applications in multimedia databases , 2001, CIKM '01.

[40]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[41]  Ophir Frieder,et al.  Information Retrieval , 2004, The Kluwer International Series on Information Retrieval.

[42]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[43]  L. Platzman,et al.  Heuristics Based on Spacefilling Curves for Combinatorial Problems in Euclidean Space , 1988 .

[44]  Anil K. Jain,et al.  Combining classifiers for face recognition , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[45]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[46]  M. Köppen,et al.  The Curse of Dimensionality , 2010 .

[47]  Walter A. Burkhard,et al.  Some approaches to best-match file searching , 1973, Commun. ACM.

[48]  Christos Faloutsos,et al.  Efficient and effective Querying by Image Content , 1994, Journal of Intelligent Information Systems.