Representative Selection in Nonmetric Datasets

This study considers the problem of representative selection: choosing a subset of data points from a dataset that best represents its overall set of elements. This subset needs to inherently reflect the type of information contained in the entire set, while minimizing redundancy. For such purposes, clustering might seem like a natural approach. However, existing clustering methods are not ideally suited for representative selection, especially when dealing with nonmetric data, in which only a pairwise similarity measure exists. In this article we propose δ-medoids, a novel approach that can be viewed as an extension of the k-medoids algorithm and is specifically suited for sample representative selection from nonmetric data. We empirically validate δ-medoids in two domains: music analysis and motion analysis. We also show some theoretical bounds on the performance of δ-medoids and the hardness of representative selection in general.

[1]  Jonathan Schaeffer,et al.  Opponent Modeling in Poker , 1998, AAAI/IAAI.

[2]  Sheng Tang,et al.  Beyond Kmedoids: Sparse Model Based Medoids Algorithm for Representative Selection , 2013, MMM.

[3]  Zoubin Ghahramani,et al.  A new approach to data driven clustering , 2006, ICML.

[4]  Ming Yang,et al.  Discovery of Collocation Patterns: from Visual Words to Visual Phrases , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Milind Tambe Recursive Agent and Agent-Group Tracking in a Real-Time Dynamic Environment , 1995, ICMAS.

[6]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[7]  Mark T. Maybury,et al.  Automatic Summarization , 2002, Computational Linguistics.

[8]  Hiroaki Kitano,et al.  The RoboCup Synthetic Agent Challenge 97 , 1997, IJCAI.

[9]  Ani Nenkova,et al.  A Survey of Text Summarization Techniques , 2012, Mining Text Data.

[10]  Youssef Hadi,et al.  Video summarization by k-medoid clustering , 2006, SAC '06.

[11]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[12]  Guillermo Sapiro,et al.  Finding Exemplars from Pairwise Dissimilarities via Simultaneous Sparse Recovery , 2012, NIPS.

[13]  Edward Y. Chang,et al.  Parallel Spectral Clustering in Distributed Systems , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  北野 宏明,et al.  RoboCup-97 : robot soccer World Cup I , 1998 .

[15]  Marc Leman,et al.  Content-Based Music Information Retrieval: Current Directions and Future Challenges , 2008, Proceedings of the IEEE.

[16]  Tong Zhang,et al.  Covering Number Bounds of Certain Regularized Linear Function Classes , 2002, J. Mach. Learn. Res..

[17]  David B. Shmoys,et al.  A Best Possible Heuristic for the k-Center Problem , 1985, Math. Oper. Res..

[18]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[19]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[20]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[21]  Inderjit S. Dhillon,et al.  Information-theoretic metric learning , 2006, ICML '07.

[22]  Ran Raz,et al.  A sub-constant error-probability low-degree test, and a sub-constant error-probability PCP characterization of NP , 1997, STOC '97.

[23]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[24]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[25]  Weiyi Meng,et al.  Efficient SPectrAl Neighborhood blocking for entity resolution , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[26]  Christopher Ariza,et al.  Music21: A Toolkit for Computer-Aided Musicology and Symbolic Music Data , 2010, ISMIR.

[27]  Douglas Eck,et al.  Aggregate features and ADABOOST for music classification , 2006, Machine Learning.

[28]  Geraint A. Wiggins,et al.  A Comparison of Statistical and Rule-Based Models of Melodic Segmentation , 2008, ISMIR.

[29]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[30]  Wei-Ta Chu,et al.  Automatic selection of representative photo and smart thumbnailing using near-duplicate detection , 2008, ACM Multimedia.

[31]  David Carmel,et al.  Opponent Modeling in Multi-Agent Systems , 1995, Adaption and Learning in Multi-Agent Systems.

[32]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[33]  Riccardo Ortale,et al.  Distance-based Clustering of XML Documents , 2003 .

[34]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[35]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[36]  Paul Lamere,et al.  Social Tagging and Music Information Retrieval , 2008 .

[37]  Christoph F. Eick,et al.  Using representative-based clustering for nearest neighbor dataset editing , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[38]  Milind Tambe,et al.  RESC: An Approach for Real-time, Dynamic Agent Tracking , 1995, IJCAI.

[39]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[40]  David W. Murray,et al.  Wearable hand activity recognition for event summarization , 2005, Ninth IEEE International Symposium on Wearable Computers (ISWC'05).

[41]  David B. Shmoys,et al.  A unified approach to approximation algorithms for bottleneck problems , 1986, JACM.

[42]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[43]  Elizabeth D. Liddy,et al.  Advances in Automatic Text Summarization , 2001, Information Retrieval.

[44]  Edward Y. Chang,et al.  Parallel Spectral Clustering , 2008, ECML/PKDD.

[45]  Shlomo Dubnov,et al.  Using Machine-Learning Methods for Musical Style Modeling , 2003, Computer.

[46]  Mark T. Maybury,et al.  Advances in Automatic Text Summarization , 1999 .

[47]  Nicholas Cook,et al.  Computational and Comparative Musicology , 2004 .

[48]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .