Count Data Clustering Using Unsupervised Localized Feature Selection and Outliers Rejection

This paper presents an unsupervised statistical model for simultaneous clustering, feature selection and outlier rejection in the case of count data. The proposed model is based on a finite discrete mixture to which a uniform component is added to ensure robustness to outliers and noise. The consideration of a finite mixture model is justified by its flexibility, its solid grounding in the theory of statistics and its competitive results. We derive a complete maximum a posteriori learning approach that does not require a priori knowledge about the number of outliers and the number of clusters. A rigorous expectation maximization (EM) algorithm, based on the formulation of a maximum a posteriori (MAP) estimation, is also provided. We report experimental results of applying our model to the challenging problems of visual scenes categorization and texture discrimination.

[1]  Katharina Morik,et al.  Combining Statistical Learning with a Knowledge-Based Approach - A Case Study in Intensive Care Monitoring , 1999, ICML.

[2]  Andrew Zisserman,et al.  Texture classification: are filter banks necessary? , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[3]  Nizar Bouguila,et al.  A hierarchical statistical model for object classification , 2010, 2010 IEEE International Workshop on Multimedia Signal Processing.

[4]  Nizar Bouguila,et al.  Using unsupervised learning of a finite Dirichlet mixture model to improve pattern recognition applications , 2005, Pattern Recognit. Lett..

[5]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[6]  Hongxing He,et al.  A comparative study of RNN for outlier detection in data mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[7]  Denise Draper,et al.  Localized Partial Evaluation of Belief Networks , 1994, UAI.

[8]  Rina Dechter,et al.  Mini-Buckets: A General Scheme for Generating Approximations in Automated Reasoning , 1997, IJCAI.

[9]  G. V. Kass,et al.  Location of Several Outliers in Multiple-Regression Data Using Elemental Sets , 1984 .

[10]  David L. Woodruff,et al.  Identification of Outliers in Multivariate Data , 1996 .

[11]  Nizar Bouguila,et al.  A generative model for spatial color image databases categorization , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Wang Jeen-Shing,et al.  A Cluster Validity Measure With Outlier Detection for Support Vector Clustering , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[13]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[14]  Nizar Bouguila,et al.  Clustering of Count Data Using Generalized Dirichlet Multinomial Distributions , 2008, IEEE Transactions on Knowledge and Data Engineering.

[15]  Dmitry Pavlov,et al.  Sequence modeling with mixtures of conditional maximum entropy distributions , 2003, Third IEEE International Conference on Data Mining.

[16]  Nizar Bouguila,et al.  A Hybrid Feature Extraction Selection Approach for High-Dimensional Non-Gaussian Data Clustering , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Nizar Bouguila,et al.  Spatial Color Image Databases Summarization , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[18]  Robert F. Ling,et al.  On the theory and construction of k-clusters , 1972, Comput. J..

[19]  Nizar Bouguila,et al.  A discrete mixture-based kernel for SVMs: Application to spam and image categorization , 2009, Inf. Process. Manag..

[20]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[21]  Yan Zhou,et al.  Adaptive spam filtering using dynamic feature space , 2005, 17th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'05).

[22]  David M. Pennock,et al.  Mixtures of Conditional Maximum Entropy Models , 2003, ICML.

[23]  Nizar Bouguila,et al.  Simultaneous Non-gaussian Data Clustering, Feature Selection and Outliers Rejection , 2011, PReMI.

[24]  Nizar Bouguila,et al.  Discrete data clustering using finite mixture models , 2009, Pattern Recognit..

[25]  Wei Pan,et al.  Bioinformatics Original Paper Incorporating Gene Functions as Priors in Model-based Clustering of Microarray Gene Expression Data , 2022 .

[26]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[27]  Nizar Bouguila,et al.  High-Dimensional Unsupervised Selection and Estimation of a Finite Generalized Dirichlet Mixture Model Based on Minimum Message Length , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[29]  A. Hadi A Modification of a Method for the Detection of Outliers in Multivariate Samples , 1994 .

[30]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[31]  Abraham Silberschatz,et al.  On Subjective Measures of Interestingness in Knowledge Discovery , 1995, KDD.

[32]  Mario Fritz,et al.  On the Significance of Real-World Conditions for Material Classification , 2004, ECCV.

[33]  Nizar Bouguila,et al.  Online clustering via finite mixtures of Dirichlet and minimum message length , 2006, Eng. Appl. Artif. Intell..

[34]  Nizar Bouguila,et al.  Count Data Modeling and Classification Using Finite Mixtures of Distributions , 2011, IEEE Transactions on Neural Networks.

[35]  N. Bouguila,et al.  A Dirichlet process mixture of dirichlet distributions for classification and prediction , 2008, 2008 IEEE Workshop on Machine Learning for Signal Processing.

[36]  R. M. Cormack,et al.  A Review of Classification , 1971 .

[37]  Tom M. Mitchell,et al.  Does Machine Learning Really Work? , 1997, AI Mag..

[38]  Hongxing He,et al.  Outlier Detection Using Replicator Neural Networks , 2002, DaWaK.

[39]  Nizar Bouguila,et al.  A Model-Based Approach for Discrete Data Clustering and Feature Weighting Using MAP and Stochastic Complexity , 2009, IEEE Transactions on Knowledge and Data Engineering.

[40]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[41]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[42]  Nizar Bouguila,et al.  On Discrete Data Clustering , 2008, PAKDD.

[43]  Nizar Bouguila,et al.  A study of spam filtering using support vector machines , 2010, Artificial Intelligence Review.

[44]  Jing Hua,et al.  Localized feature selection for clustering , 2008, Pattern Recognit. Lett..

[45]  Nizar Bouguila,et al.  Unsupervised Feature Selection for Accurate Recommendation of High-Dimensional Image Data , 2007, NIPS.

[46]  Bruce D'Ambrosio,et al.  Incremental Probabilistic Inference , 1993, UAI.

[47]  N. Bouguila,et al.  A data-driven mixture kernel for count data classification using support vector machines , 2008, 2008 IEEE Workshop on Machine Learning for Signal Processing.

[48]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[49]  David Heckerman,et al.  Asymptotic Model Selection for Directed Networks with Hidden Variables , 1996, UAI.

[50]  Marko Robnik-Sikonja,et al.  Overcoming the Myopia of Inductive Learning Algorithms with RELIEFF , 2004, Applied Intelligence.

[51]  Nizar Bouguila,et al.  Color texture classification by a discrete statistical model and feature selection , 2009, ICIP 2009.

[52]  Eric Horvitz,et al.  Reformulating Inference Problems Through Selective Conditioning , 1992, UAI.

[53]  Nizar Bouguila,et al.  Unsupervised selection of a finite Dirichlet mixture model: an MML-based approach , 2006, IEEE Transactions on Knowledge and Data Engineering.

[54]  Andrzej S. Kosinski,et al.  A procedure for the detection of multivariate outliers , 1998 .

[55]  Nizar Bouguila,et al.  Online spam filtering using support vector machines , 2009, 2009 IEEE Symposium on Computers and Communications.

[56]  D. Ziou,et al.  Ieee Workshop on Machine Learning for Signal Processing Improving Content Based Image Retrieval Systems Using Finite M U Lt I N 0 M I a L D I Rich Let M I Xtu R E , 2022 .

[57]  Nizar Bouguila,et al.  Discrete visual features modeling via leave-one-out likelihood estimation and applications , 2010, J. Vis. Commun. Image Represent..

[58]  Narendra Ahuja,et al.  A uniformity criterion and algorithm for data clustering , 2008, 2008 19th International Conference on Pattern Recognition.

[59]  Yan Zhou,et al.  Adaptive Spam Filtering Using Dynamic Feature Spaces , 2007, Int. J. Artif. Intell. Tools.

[60]  N. Bouguila,et al.  A Novel Finite Mixture Model for Count Data Modeling , 2007, 2007 IEEE International Conference on Signal Processing and Communications.

[61]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[62]  Pat Langley,et al.  Induction of Selective Bayesian Classifiers , 1994, UAI.