Efficient similarity search in structured data

Modern database applications are characterized by two major aspects: the use of complex data types with internal structure and the need for new data analysis methods. The focus of database users has shifted from simple queries to complex analyses of the data, known as knowledge discovery in databases. Important tasks in this area are the grouping of data objects (clustering), the classification of new data objects or the detection of exceptional data objects (outlier detection). Most algorithms for solving those problems are based on similarity search in databases. This makes efficient similarity search in large databases of structured objects an important basic operation for modern database applications. In this thesis we develop efficient methods for similarity search in large databases of structured data and improve the efficiency of existing query processing techniques. For the data objects, only a tree or graph structure is assumed which can be extended with arbitrary attribute information. Starting with an analysis of the demands from two example applications, several important requirements for similarity measures are identified. One aspect is the adaptability of the similarity search method to the requirements of the user and the application domain. This can even imply a change of the similarity measure between two successive queries of the same user. An explanation component which makes clear why objects are considered similar by the system is a necessary precondition for a purposeful adaption of the measure. Consequently, the edit distance, well-known from string processing, is a common similarity measure for graph structured objects. Its feature to allow a visualization of corresponding substructures and the possibility to weight single operations are the reason for this popularity. But it turns out that the edit distance and similar measures for tree structures are computationally extremely complex which makes them unsuitable for today's large and even growing databases. Therefore, we develop a multi-step query processing architecture which reduces the number of necessary distance calculations significantly. This is achieved by employing suitable filter methods. Furthermore, we show that by easing certain restrictions on the similarity measure, a significant performance gain can be obtained without reducing the quality of the measure. To achieve this, matchings of substructures (vertices or edges) of the data objects are determined. An additional cost function for those matchings allows to derive a similarity measure for structured data, called the edge matching distance, from the cost optimal matching of the substructures. But even for this new similarity measure, efficiency can be improved significantly by using a multi-step query processing approach. This allows the use of the edge matching distance for knowledge discovery applications in large databases. Within the thesis, the properties of our new similarity search methods are proved both theoretically and through experiments.

[1]  Christos Faloutsos,et al.  Fast and Effective Retrieval of Medical Tumor Shapes , 1998, IEEE Trans. Knowl. Data Eng..

[2]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[3]  Tao Jiang,et al.  Alignment of Trees - An Alternative to Tree Edit , 1994, Theor. Comput. Sci..

[4]  Thomas Seidl,et al.  Adaptable Similarity Search in 3-D Spatial Database Systems (Abstract) , 1998, Datenbank Rundbr..

[5]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[6]  Hans-Peter Kriegel,et al.  Efficient User-Adaptable Similarity Search in Large Multimedia Databases , 1997, VLDB.

[7]  Hans-Peter Kriegel,et al.  Similarity Search in Structured Data , 2003, DaWaK.

[8]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[9]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[10]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..

[11]  Robert E. Tarjan,et al.  A V² Algorithm for Determining Isomorphism of Planar Graphs , 1971, Inf. Process. Lett..

[12]  Yannis Manolopoulos,et al.  Structure-based similarity search with graph histograms , 1999, Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99.

[13]  Hans-Peter Kriegel,et al.  Nearest Neighbor Classification in 3D Protein Databases , 1999, ISMB.

[14]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.

[15]  Philip N. Klein,et al.  Recognition of Shapes by Editing Shock Graphs , 2001, ICCV.

[16]  Erhard Rahm,et al.  Similarity flooding: a versatile graph matching algorithm and its application to schema matching , 2002, Proceedings 18th International Conference on Data Engineering.

[17]  Jacobo Torán,et al.  On the hardness of graph isomorphism , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[18]  Ignatios Vakalis,et al.  Using graph distance in object recognition , 1990, CSC '90.

[19]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[20]  Gabriel Valiente,et al.  Algorithms on Trees and Graphs , 2002, Springer Berlin Heidelberg.

[21]  Hans-Peter Kriegel,et al.  Multidimensional Order Preserving Linear Hashing with Partial Expansions , 1986, ICDT.

[22]  Gerhard Sagerer,et al.  Segmentation of molecular surfaces based on their convex hull , 1995, Proceedings., International Conference on Image Processing.

[23]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[24]  Sergey Brin,et al.  Near Neighbor Search in Large Metric Spaces , 1995, VLDB.

[25]  Norbert Krüger,et al.  Face recognition by elastic bunch graph matching , 1997, Proceedings of International Conference on Image Processing.

[26]  Christian Böhm,et al.  Independent quantization: an index compression technique for high-dimensional data spaces , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[27]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[28]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[29]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[30]  Christos Faloutsos,et al.  Slim-Trees: High Performance Metric Trees Minimizing Overlap Between Nodes , 2000, EDBT.

[31]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[32]  Horst Bunke,et al.  A graph distance metric based on the maximal common subgraph , 1998, Pattern Recognit. Lett..

[33]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[34]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[35]  Euripides G. M. Petrakis,et al.  Design and evaluation of spatial similarity approaches for image retrieval , 2002, Image Vis. Comput..

[36]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[37]  Hanan Samet,et al.  Ranking in Spatial Databases , 1995, SSD.

[38]  Tao Jiang,et al.  Some MAX SNP-Hard Results Concerning Unordered Labeled Trees , 1994, Inf. Process. Lett..

[39]  Christos Faloutsos,et al.  QBIC project: querying images by content, using color, texture, and shape , 1993, Electronic Imaging.

[40]  Z. Meral Özsoyoglu,et al.  Distance-based indexing for high-dimensional metric spaces , 1997, SIGMOD '97.

[41]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[42]  Hans-Peter Kriegel,et al.  Optimal multi-step k-nearest neighbor search , 1998, SIGMOD '98.

[43]  Padhraic Smyth,et al.  Knowledge Discovery and Data Mining: Towards a Unifying Framework , 1996, KDD.

[44]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[45]  G. Chartrand,et al.  Graph similarity and distance in graphs , 1998 .

[46]  Kaizhong Zhang,et al.  On the Editing Distance Between Undirected Acyclic Graphs , 1996, Int. J. Found. Comput. Sci..

[47]  Hans-Peter Kriegel,et al.  Efficient similarity search in large databases of tree structured objects , 2004, Proceedings. 20th International Conference on Data Engineering.

[48]  King-Sun Fu,et al.  A distance measure between attributed relational graphs for pattern recognition , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[49]  P Willett,et al.  Representation and searching of carbohydrate structures using graph-theoretic techniques. , 1997, Carbohydrate research.

[50]  Horst Bunke,et al.  Efficient Graph Matching for Video Indexing , 1997, GbRPR.

[51]  J. Köbler,et al.  The Graph Isomorphism Problem: Its Structural Complexity , 1993 .

[52]  Kaizhong Zhang,et al.  A constrained edit distance between unordered labeled trees , 1996, Algorithmica.

[53]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[54]  Kaizhong Zhang,et al.  On the Editing Distance Between Unordered Labeled Trees , 1992, Inf. Process. Lett..

[55]  R. Bayer,et al.  Organization and maintenance of large ordered indices , 1970, SIGFIDET '70.

[56]  Hans-Peter Kriegel,et al.  Web site mining: a new way to spot competitors, customers and suppliers in the world wide web , 2002, KDD.

[57]  David M. Mount,et al.  Isomorphism of graphs with bounded eigenvalue multiplicity , 1982, STOC '82.

[58]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[59]  Kaizhong Zhang,et al.  A System for Approximate Tree Matching , 1994, IEEE Trans. Knowl. Data Eng..

[60]  Eugene M. Luks,et al.  Isomorphism of graphs of bounded valence can be tested in polynomial time , 1980, 21st Annual Symposium on Foundations of Computer Science (sfcs 1980).

[61]  Jürg Nievergelt,et al.  The Grid File: An Adaptable, Symmetric Multikey File Structure , 1984, TODS.

[62]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.