New techniques for clustering complex objects

The tremendous amount of data produced nowadays in various application domains such as molecular biology or geography can only be fully exploited by efficient and effective data mining tools. One of the primary data mining tasks is clustering, which is the task of partitioning points of a data set into distinct groups (clusters) such that two points from one cluster are similar to each other whereas two points from distinct clusters are not. Due to modern database technology, e.g. object relational databases, a huge amount of complex objects from scientific, engineering or multimedia applications is stored in database systems. Modelling such complex data often results in very high-dimensional vector data (”feature vectors”). In the context of clustering, this causes a lot of fundamental problems, commonly subsumed under the term ”Curse of Dimensionality”. As a result, traditional clustering algorithms often fail to generate meaningful results, because in such high-dimensional feature spaces data does not cluster anymore. But usually, there are clusters embedded in lower dimensional subspaces, i.e. meaningful clusters can be found if only a certain subset of features is regarded for clustering. The subset of features may even be different for varying clusters. In this thesis, we present original extensions and enhancements of the density-based clustering notion to cope with high-dimensional data. In particular, we propose an algorithm called SUBCLU (density-connected Subspace Clustering) that extends DBSCAN (Density-Based Spatial C lustering of Applications with N oise) to the problem of subspace clustering. SUBCLU efficiently computes all clusters of arbitrary shape and size that would have been found if DBSCAN were applied to all possible subspaces

[1]  Christos Faloutsos,et al.  Slim-Trees: High Performance Metric Trees Minimizing Overlap Between Nodes , 2000, EDBT.

[2]  Hans-Peter Kriegel,et al.  DBDC: Density Based Distributed Clustering , 2004, EDBT.

[3]  Hans-Peter Kriegel,et al.  Using sets of feature vectors for similarity search on voxelized CAD objects , 2003, SIGMOD '03.

[4]  Hans-Peter Kriegel,et al.  S3: similarity search in CAD database systems , 1997, SIGMOD '97.

[5]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[6]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[7]  Kaizhong Zhang,et al.  A System for Approximate Tree Matching , 1994, IEEE Trans. Knowl. Data Eng..

[8]  Ping Chen,et al.  Using the fractal dimension to cluster datasets , 2000, KDD '00.

[9]  Hans-Peter Kriegel,et al.  3D Shape Histograms for Similarity Search and Classification in Spatial Databases , 1999, SSD.

[10]  Hans-Peter Kriegel,et al.  Efficient similarity search in large databases of tree structured objects , 2004, Proceedings. 20th International Conference on Data Engineering.

[11]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[12]  Mohan S. Kankanhalli,et al.  Shape Measures for Content Based Image Retrieval: A Comparison , 1997, Inf. Process. Manag..

[13]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[14]  Chiou-Shann Fuh,et al.  Hierarchical color image region segmentation for content-based image retrieval system , 2000, IEEE Trans. Image Process..

[15]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[16]  Alok N. Choudhary,et al.  Adaptive Grids for Clustering Massive Data Sets , 2001, SDM.

[17]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[18]  Philip S. Yu,et al.  MaPle: a fast algorithm for maximal pattern-based clustering , 2003, Third IEEE International Conference on Data Mining.

[19]  Daniel A. Keim,et al.  Efficient geometry-based similarity search of 3D spatial databases , 1999, SIGMOD '99.

[20]  Kyuseok Shim,et al.  Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases , 1995, VLDB.

[21]  Hans-Peter Kriegel,et al.  Efficient Similarity Search for Hierarchical Data in Large Databases , 2004, EDBT.

[22]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[23]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[24]  Horst Bunke,et al.  A graph distance metric based on the maximal common subgraph , 1998, Pattern Recognit. Lett..

[25]  Athman Bouguettaya,et al.  On-Line Clustering , 1996, IEEE Trans. Knowl. Data Eng..

[26]  Hans-Peter Kriegel,et al.  Efficient Indexing of Complex Objects for Density-based Clustering , 2004 .

[27]  Kaizhong Zhang,et al.  On the Editing Distance Between Unordered Labeled Trees , 1992, Inf. Process. Lett..

[28]  Hans-Peter Kriegel,et al.  Incremental OPTICS: Efficient Computation of Updates in a Hierarchical Cluster Ordering , 2003, DaWaK.

[29]  James Kelly,et al.  AutoClass: A Bayesian Classification System , 1993, ML.

[30]  Bernhard Pfahringer,et al.  A Two-Level Learning Method for Generalized Multi-instance Problems , 2003, ECML.

[31]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[32]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[33]  Hans-Peter Kriegel,et al.  Optimal multi-step k-nearest neighbor search , 1998, SIGMOD '98.

[34]  Daniel A. Keim,et al.  Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering , 1999, VLDB.

[35]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[36]  Brian Everitt,et al.  Cluster analysis , 1974 .

[37]  George Karypis,et al.  Evaluation of Techniques for Classifying Biological Sequences , 2002, PAKDD.

[38]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[39]  James S. Duncan,et al.  Arrangement: A Spatial Relation Between Parts for Evaluating Similarity of Tomographic Section , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[40]  Hans-Peter Kriegel,et al.  A distribution-based clustering algorithm for mining in large spatial databases , 1998, Proceedings 14th International Conference on Data Engineering.

[41]  Bernhard Liebl,et al.  Very high compliance in an expanded MS-MS-based newborn screening program despite written parental consent. , 2002, Preventive medicine.

[42]  Padhraic Smyth,et al.  Knowledge Discovery and Data Mining: Towards a Unifying Framework , 1996, KDD.

[43]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[44]  Christian Böhm,et al.  Supervised machine learning techniques for the classification of metabolic disorders in newborns , 2004, Bioinform..

[45]  Hans-Peter Kriegel,et al.  Efficient User-Adaptable Similarity Search in Large Multimedia Databases , 1997, VLDB.

[46]  Christian Böhm,et al.  A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.

[47]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[48]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[49]  Hans-Peter Kriegel,et al.  Visually Mining through Cluster Hierarchies , 2004, SDM.

[50]  Eiichi Tanaka,et al.  The Tree-to-Tree Editing Problem , 1988, Int. J. Pattern Recognit. Artif. Intell..

[51]  Hans-Peter Kriegel,et al.  Effective similarity search on voxelized CAD objects , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[52]  Hans-Peter Kriegel,et al.  Clustering Multi-represented Objects with Noise , 2004, PAKDD.

[53]  Christos Faloutsos,et al.  How to Use the Fractal Dimension to Find Correlations between Attributes , 2002 .

[54]  T. M. Murali,et al.  A Monte Carlo algorithm for fast projective clustering , 2002, SIGMOD '02.

[55]  Hans-Peter Kriegel,et al.  Density-Connected Subspace Clustering for High-Dimensional Data , 2004, SDM.

[56]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[57]  Christian Böhm,et al.  Independent quantization: an index compression technique for high-dimensional data spaces , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[58]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[59]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[60]  Christos Faloutsos,et al.  Efficient and effective Querying by Image Content , 1994, Journal of Intelligent Information Systems.

[61]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[62]  PaperNo Recognition of shapes by editing shock graphs , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[63]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[64]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[65]  Christian Böhm,et al.  Computing Clusters of Correlation Connected objects , 2004, SIGMOD '04.

[66]  Wei-Ying Ma,et al.  A unified framework for clustering heterogeneous Web objects , 2002, Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002..

[67]  Christos Faloutsos,et al.  Fast Nearest Neighbor Search in Medical Image Databases , 1996, VLDB.

[68]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[69]  Alfred M. Bruckstein,et al.  On sequential shape descriptions , 1992, Pattern Recognit..

[70]  Lusheng Wang,et al.  Alignment of trees: an alternative to tree edit , 1995 .

[71]  Huan Liu,et al.  Feature selection for clustering - a filter solution , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[72]  Eamonn J. Keogh,et al.  Locally adaptive dimensionality reduction for indexing large time series databases , 2001, SIGMOD '01.

[73]  Hans-Peter Kriegel,et al.  Web site mining: a new way to spot competitors, customers and suppliers in the world wide web , 2002, KDD.

[74]  Hans-Peter Kriegel,et al.  Content-Based Image Retrieval Using Multiple Representations , 2004, KES.

[75]  Ignatios Vakalis,et al.  Using graph distance in object recognition , 1990, CSC '90.

[76]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.

[77]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[78]  Norbert Krüger,et al.  Face Recognition by Elastic Bunch Graph Matching , 1997, CAIP.

[79]  James E. Gary,et al.  Similar shape retrieval using a structural feature index , 1993, Inf. Syst..

[80]  Kaizhong Zhang,et al.  Finding approximate patterns in undirected acyclic graphs , 2002, Pattern Recognit..

[81]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[82]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[83]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[84]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[85]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[86]  Philip S. Yu,et al.  /spl delta/-clusters: capturing subspace correlation in a large data set , 2002, Proceedings 18th International Conference on Data Engineering.

[87]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[88]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[89]  Sharad Mehrotra,et al.  Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces , 2000, VLDB.

[90]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[91]  Hans-Peter Kriegel,et al.  Similarity Search in Structured Data , 2003, DaWaK.

[92]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[93]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[94]  Hans-Peter Kriegel,et al.  Ranking Interesting Subspaces for Clustering High Dimensional Data , 2003, PKDD.

[95]  Hans-Peter Kriegel,et al.  Similarity Search in 3D Protein Databases , 1998, German Conference on Bioinformatics.

[96]  Yannis Manolopoulos,et al.  Structure-based similarity search with graph histograms , 1999, Proceedings. Tenth International Workshop on Database and Expert Systems Applications. DEXA 99.

[97]  James Lee Hafner,et al.  Efficient Color Histogram Indexing for Quadratic Form Distance Functions , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[98]  Hongjun Lu,et al.  ReCoM: reinforcement clustering of multi-type interrelated data objects , 2003, SIGIR.

[99]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[100]  Kaizhong Zhang,et al.  A constrained edit distance between unordered labeled trees , 1996, Algorithmica.

[101]  Christian Böhm,et al.  Fast parallel similarity search in multimedia databases , 1997, SIGMOD '97.

[102]  Jonathan J. Hull,et al.  Document image database retrieval and browsing using texture analysis , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[103]  Hans-Peter Kriegel,et al.  Incremental Clustering for Mining in a Data Warehousing Environment , 1998, VLDB.

[104]  G. Chartrand,et al.  Graph similarity and distance in graphs , 1998 .

[105]  Kaizhong Zhang,et al.  On the Editing Distance Between Undirected Acyclic Graphs , 1996, Int. J. Found. Comput. Sci..

[106]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[107]  H. V. Jagadish,et al.  A retrieval technique for similar shapes , 1991, SIGMOD '91.

[108]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[109]  Wei Wang,et al.  OP-cluster: clustering by tendency in high dimensional space , 2003, Third IEEE International Conference on Data Mining.