Robust simultaneous positive data clustering and unsupervised feature selection using generalized inverted Dirichlet mixture models

The discovery, extraction and analysis of knowledge from data rely generally upon the use of unsupervised learning methods, in particular clustering approaches. Much recent research in clustering and data engineering has focused on the consideration of finite mixture models which allow to reason in the face of uncertainty and to learn by example. The adoption of these models becomes a challenging task in the presence of outliers and in the case of high-dimensional data which necessitates the deployment of feature selection techniques. In this paper we tackle simultaneously the problems of cluster validation (i.e. model selection), feature selection and outliers rejection when clustering positive data. The proposed statistical framework is based on the generalized inverted Dirichlet distribution that offers a more practical and flexible alternative to the inverted Dirichlet which has a very restrictive covariance structure. The learning of the parameters of the resulting model is based on the minimization of a message length objective incorporating prior knowledge. We use synthetic data and real data generated from challenging applications, namely visual scenes and objects clustering, to demonstrate the feasibility and advantages of the proposed method.

[1]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[2]  Yair Weiss,et al.  Learning object detection from a small number of examples: the importance of good features , 2004, CVPR 2004.

[3]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[4]  Jan M. Zytkow,et al.  Unified Algorithm for Undirected Discovery of Execption Rules , 2000, PKDD.

[5]  D. F. Andrews,et al.  A Robust Method for Multiple Linear Regression , 1974 .

[6]  Philip S. Yu,et al.  Redefining Clustering for High-Dimensional Applications , 2002, IEEE Trans. Knowl. Data Eng..

[7]  Jukka Corander,et al.  Bayesian search of functionally divergent protein subgroups and their function specific residues , 2006, Bioinform..

[8]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[9]  Nizar Bouguila,et al.  A Hybrid Feature Extraction Selection Approach for High-Dimensional Non-Gaussian Data Clustering , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Nizar Bouguila,et al.  Count Data Clustering Using Unsupervised Localized Feature Selection and Outliers Rejection , 2011, 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence.

[11]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[12]  Christian P. Robert,et al.  The Bayesian choice , 1994 .

[13]  Jan M. Zytkow,et al.  Unified algorithm for undirected discovery of exception rules , 2005, Int. J. Intell. Syst..

[14]  Tom Fawcett,et al.  Combining Data Mining and Machine Learning for Effective User Profiling , 1996, KDD.

[15]  Hichem Frigui,et al.  Unsupervised learning of prototypes and attribute weights , 2004, Pattern Recognit..

[16]  Huan Liu,et al.  Dimensionality reduction via discretization , 1996, Knowl. Based Syst..

[17]  Tom Fawcett,et al.  Adaptive Fraud Detection , 1997, Data Mining and Knowledge Discovery.

[18]  Evgeniy Gabrilovich,et al.  Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5 , 2004, ICML.

[19]  Raymond T. Ng,et al.  Finding Intensional Knowledge of Distance-Based Outliers , 1999, VLDB.

[20]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[21]  Nicolò Cesa-Bianchi,et al.  Online Learning of Noisy Data , 2011, IEEE Transactions on Information Theory.

[22]  Pierre-Emmanuel Jouve,et al.  A Filter Feature Selection Method for Clustering , 2005, ISMIS.

[23]  Nizar Bouguila,et al.  Positive vectors clustering using inverted Dirichlet finite mixture models , 2012, Expert Syst. Appl..

[24]  Dimitrios Gunopulos,et al.  Automatic Subspace Clustering of High Dimensional Data , 2005, Data Mining and Knowledge Discovery.

[25]  Anil K. Jain,et al.  Simultaneous feature selection and clustering using mixture models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  G. lingmnwah ON THE GENERALISED INVERTED DIRICHLET DISTRIBUTION , 1976 .

[27]  George G. Tiao,et al.  The Inverted Dirichlet Distribution with Applications , 1965 .

[28]  R. Gnanadesikan,et al.  Weighting and selection of variables for cluster analysis , 1995 .

[29]  Azriel Rosenfeld,et al.  Robust regression methods for computer vision: A review , 1991, International Journal of Computer Vision.

[30]  Kuo-Lung Wu,et al.  Unsupervised possibilistic clustering , 2006, Pattern Recognit..

[31]  Yuchou Chang,et al.  Unsupervised data pruning for clustering of noisy data , 2008, Knowl. Based Syst..

[32]  P. Rousseeuw,et al.  Computing depth contours of bivariate point clouds , 1996 .

[33]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[34]  Venkatesan Guruswami,et al.  Combinatorial feature selection problems , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[35]  Michael J. Black,et al.  The Robust Estimation of Multiple Motions: Parametric and Piecewise-Smooth Flow Fields , 1996, Comput. Vis. Image Underst..

[36]  C. S. Wallace,et al.  Statistical and Inductive Inference by Minimum Message Length (Information Science and Statistics) , 2005 .

[37]  Christian Böhm,et al.  A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.

[38]  Nizar Bouguila,et al.  An Infinite Mixture of Inverted Dirichlet Distributions , 2011, ICONIP.

[39]  A. D. Gordon Constructing dissimilarity measures , 1990 .

[40]  Vipin Kumar,et al.  Feature bagging for outlier detection , 2005, KDD '05.

[41]  Theodore Johnson,et al.  Fast Computation of 2-Dimensional Depth Contours , 1998, KDD.

[42]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[43]  Kashif Javed,et al.  Feature Selection Based on Class-Dependent Densities for High-Dimensional Binary Data , 2012, IEEE Transactions on Knowledge and Data Engineering.

[44]  Saralees Nadarajah,et al.  EXACT AND APPROXIMATE DISTRIBUTIONS FOR THE LINEAR COMBINATION OF INVERTED DIRICHLET COMPONENTS , 2006 .

[45]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[46]  Ruben H. Zamar,et al.  Robust space transformations for distance-based operations , 2001, KDD '01.

[47]  Bin Zhang,et al.  Generalized K-Harmonic Means - Dynamic Weighting of Data in Unsupervised Learning , 2001, SDM.

[48]  Nizar Bouguila,et al.  A Model-Based Approach for Discrete Data Clustering and Feature Weighting Using MAP and Stochastic Complexity , 2009, IEEE Transactions on Knowledge and Data Engineering.

[49]  E. George The Variable Selection Problem , 2000 .

[50]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[51]  Sameer Singh,et al.  Novelty detection: a review - part 1: statistical approaches , 2003, Signal Process..

[52]  Nizar Bouguila,et al.  Learning Inverted Dirichlet Mixtures for Positive Data Clustering , 2011, RSFDGrC.

[53]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[54]  Peter McBurney,et al.  Chance Discovery , 2003, Advanced Information Processing.

[55]  Mia Hubert,et al.  Robust PCA and classification in biosciences , 2004, Bioinform..

[56]  Dean P. Foster,et al.  Calibration and empirical Bayes variable selection , 2000 .

[57]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[58]  Qingsheng Zhu,et al.  Finding key attribute subset in dataset for outlier detection , 2011, Knowl. Based Syst..

[59]  Raphael Gottardo,et al.  Bayesian robust transformation and variable selection: A unified approach , 2009 .

[60]  Jorma Rissanen,et al.  Hypothesis Selection and Testing by the MDL Principle , 1999, Comput. J..

[61]  Minqiang Li,et al.  Multinomial mixture model with feature selection for text clustering , 2008, Knowl. Based Syst..

[62]  Nizar Bouguila,et al.  A finite mixture model for simultaneous high-dimensional clustering, localized feature selection and outlier rejection , 2012, Expert Syst. Appl..

[63]  Vipin Kumar,et al.  CREDOS: Classification Using Ripple Down Structure (A Case for Rare Classes) , 2004, SDM.

[64]  Chris Fraley,et al.  Algorithms for Model-Based Gaussian Hierarchical Clustering , 1998, SIAM J. Sci. Comput..

[65]  J. Copas Regression, Prediction and Shrinkage , 1983 .

[66]  Edward I. George,et al.  ADAPTIVE BAYESIAN CRITERIA IN VARIABLE SELECTION FOR GENERALIZED LINEAR MODELS , 2007 .

[67]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[68]  J. Carroll,et al.  A Feature-Based Approach to Market Segmentation via Overlapping K-Centroids Clustering , 1997 .

[69]  Peng Zhang,et al.  A Highly Robust Estimator Through Partially Likelihood Function Modeling and Its Application in Computer Vision , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[70]  Ulrike von Luxburg,et al.  Consistent Minimization of Clustering Objective Functions , 2007, NIPS.

[71]  Ohad Shamir,et al.  Cluster Stability for Finite Samples , 2007, NIPS.