Fast, linear time, m-adic hierarchical clustering for search and retrieval using the Baire metric, with linkages to generalized ultrametrics, hashing, formal concept analysis, and precision of data measurement

We describe many vantage points on the Baire metric and its use in clustering data, or its use in preprocessing and structuring data in order to support search and retrieval operations. In some cases, we proceed directly to clusters and do not directly determine the distances. We show how a hierarchical clustering can be read directly from one pass through the data. We offer insights also on practical implications of precision of datameasurement. As a mechanism for treating multidimensional data, including very high dimensional data, we use random projections.

[1]  Ingemar J. Cox,et al.  Audio Fingerprinting: Nearest Neighbor Search in High Dimensional Binary Spaces , 2005, J. VLSI Signal Process..

[2]  Brian A. Davey,et al.  An Introduction to Lattices and Order , 1989 .

[3]  Fionn Murtagh,et al.  Multidimensional clustering algorithms , 1985 .

[4]  M. Janowitz Cluster Analysis Based on Posets , 2007 .

[5]  Trevor Darrell,et al.  Locality-Sensitive Hashing Using Stable Distributions , 2006 .

[6]  M. F. Janowitz,et al.  An Order Theoretic Model for Cluster Analysis , 1978 .

[7]  Sanjoy Dasgupta,et al.  An elementary proof of a theorem of Johnson and Lindenstrauss , 2003, Random Struct. Algorithms.

[8]  Charles E. Heckler,et al.  Correspondence Analysis and Data Coding With Java and R , 2007, Technometrics.

[9]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[10]  Fionn Murtagh Expected-Time Complexity Results for Hierarchic Clustering Algorithms Which Use Cluster Centres , 1983, Inf. Process. Lett..

[11]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.

[12]  J. S. Marron,et al.  Geometric representation of high dimension, low sample size data , 2005 .

[13]  Patrick Erik Bradley,et al.  Mumford Dendrograms , 2007, Comput. J..

[14]  Jenny Benois-Pineau,et al.  Segmentation of images in p-Adic and Euclidean Metrics , 2001 .

[15]  Patrick Erik Bradley On p-adic classification , 2009, ArXiv.

[16]  Pascal Hitzler,et al.  Generalized Distance Functions in the Theory of Computation , 2010, Comput. J..

[17]  Fionn Murtagh,et al.  On Ultrametricity, Data Coding, and Computation , 2004, J. Classif..

[18]  G. Toulouse,et al.  Ultrametricity for physicists , 1986 .

[19]  Kenneth Ward Church,et al.  Very sparse random projections , 2006, KDD '06.

[20]  Fionn Murtagh,et al.  Hierarchical Clustering of Massive, High Dimensional Data Sets by Exploiting Ultrametric Embedding , 2008, SIAM J. Sci. Comput..

[21]  Jae-Woo Chang,et al.  A new cell-based clustering method for large, high-dimensional data in data mining applications , 2002, SAC '02.

[22]  Abdollah Homaifar,et al.  Satellite image retrieval using low memory locality sensitive hashing in Euclidean space , 2011, Earth Sci. Informatics.

[23]  Jianhong Wu,et al.  Data clustering - theory, algorithms, and applications , 2007 .

[24]  Peter Grabusts,et al.  Using grid-clustering methods in data classification , 2002, Proceedings. International Conference on Parallel Computing in Electrical Engineering.

[25]  D. W. Bunn,et al.  Group Choice , 1980 .

[26]  Fionn Murtagh,et al.  Fast, Linear Time Hierarchical Clustering using the Baire Metric , 2011, J. Classif..

[27]  Dimitrios Gunopulos,et al.  Dimensionality reduction by random projection and latent semantic indexing , 2003 .

[28]  Pat Langley,et al.  Editorial: On Machine Learning , 1986, Machine Learning.

[29]  Won Suk Lee,et al.  Statistical grid-based clustering over data streams , 2004, SGMD.

[30]  Santosh S. Vempala,et al.  The Random Projection Method , 2005, DIMACS Series in Discrete Mathematics and Theoretical Computer Science.

[31]  Pedro Albornoz,et al.  Search and retrieval in massive data collections , 2010 .

[32]  Ting Chen,et al.  Scalable Partitioning and Exploration of Chemical Spaces Using Geometric Hashing , 2006, J. Chem. Inf. Model..

[33]  Peter Frankl,et al.  The Johnson-Lindenstrauss lemma and the sphericity of some graphs , 1987, J. Comb. Theory B.

[34]  Fionn Murtagh,et al.  Fast Hierarchical Clustering from the Baire Distance , 2010 .

[35]  Andrei Khrennikov,et al.  Applied Algebraic Dynamics , 2009 .

[36]  Daniel A. Keim,et al.  On Knowledge Discovery and Data Mining , 1997 .

[37]  A. C. M. van Rooij,et al.  Non-Archimedean functional analysis , 1978 .

[38]  Henrik Boström,et al.  Reducing High-Dimensional Data by Principal Component Analysis vs. Random Projection for Nearest Neighbor Classification , 2006, 2006 5th International Conference on Machine Learning and Applications (ICMLA'06).

[39]  Guojun Gan,et al.  Data Clustering: Theory, Algorithms, and Applications (ASA-SIAM Series on Statistics and Applied Probability) , 2007 .

[40]  Ying He,et al.  Randomly Projected KD-Trees with Distance Metric Learning for Image Retrieval , 2011, MMM.

[41]  Sanjoy Dasgupta,et al.  Experiments with Random Projection , 2000, UAI.

[42]  Carla E. Brodley,et al.  Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach , 2003, ICML.

[43]  Dmitriy Fradkin,et al.  Experiments with random projections for machine learning , 2003, KDD '03.