Approximate search engine optimization for directory service

Today, in many practical e-commerce systems, the real stored data usually are short strings, such as names, addresses, or other information. Searching data within these short strings is not the same as searching within longer strings. General search engines try their best to scan all long strings (or articles) quickly, and find out the places that match the search conditions. Some great online search algorithms (such as "agrep" as used inside glimpse, or "cgrep " as used inside compressed indices, or 'NR-grep') are proposed for searching without any indices in the sub-linear time O(n). However, for short strings (n is small), the practical performance of algorithms of O(n) and O(n) are much the same. Therefore, suitable indices are necessary to optimize the performance of the search engine. On the other hand, directory services are more and more important because of its optimization for searching data. The data stored in directory servers are almost short strings. The approximate search engine for directory service must take the properties of short strings into considerations. In our previous research, we have designed one approximate search engine especially for short strings by using filters to filter out the possible short strings, and then checking for the answers. However the performance of the previous search engine needs to be enhanced. In this paper, we propose new architecture and algorithm to optimize the performance of searching for directory service.

[1]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[2]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[3]  Udi Manber,et al.  GLIMPSE: A Tool to Search Through Entire File Systems , 1994, USENIX Winter.

[4]  Eugene L. Lawler,et al.  Sublinear approximate string matching and biological applications , 1994, Algorithmica.

[5]  Z. Meral Özsoyoglu,et al.  Distance-based indexing for high-dimensional metric spaces , 1997, SIGMOD '97.

[6]  Kai-Hsiang Yang,et al.  Approximate string matching in LDAP based on edit distance , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[7]  Richard Cole,et al.  Approximate string matching: a simpler faster algorithm , 2002, SODA '98.

[8]  P. Sellers On the Theory and Computation of Evolutionary Distances , 1974 .

[9]  Erkki Sutinen,et al.  On Using q-Gram Locations in Approximate String Matching , 1995, ESA.

[10]  Erkki Sutinen,et al.  Indexing text with approximate q-grams , 2000, J. Discrete Algorithms.

[11]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[12]  Mike Paterson,et al.  A Faster Algorithm Computing String Edit Distances , 1980, J. Comput. Syst. Sci..

[13]  Gonzalo Navarro,et al.  Improving an Algorithm for Approximate Pattern Matching , 2001, Algorithmica.

[14]  Ricardo A. Baeza-Yates,et al.  A New Indexing Method for Approximate String Matching , 1999, CPM.

[15]  Gad M. Landau,et al.  Incremental String Comparison , 1998, SIAM J. Comput..

[16]  Peter H. Sellers,et al.  The Theory and Computation of Evolutionary Distances: Pattern Recognition , 1980, J. Algorithms.