A Model-Based Approach for Discrete Data Clustering and Feature Weighting Using MAP and Stochastic Complexity

In this paper, we consider the problem of unsupervised discrete feature selection/weighting. Indeed, discrete data are an important component in many data mining, machine learning, image processing, and computer vision applications. However, much of the published work on unsupervised feature selection has concentrated on continuous data. We propose a probabilistic approach that assigns relevance weights to discrete features that are considered as random variables modeled by finite discrete mixtures. The choice of finite mixture models is justified by its flexibility which has led to its widespread application in different domains. For the learning of the model, we consider both Bayesian and information-theoretic approaches through stochastic complexity. Experimental results are presented to illustrate the feasibility and merits of our approach on a difficult problem which is clustering and recognizing visual concepts in different image data. The proposed approach is successfully applied also for text clustering.

[1]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[2]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[3]  Nizar Bouguila,et al.  Clustering of Count Data Using Generalized Dirichlet Multinomial Distributions , 2008, IEEE Transactions on Knowledge and Data Engineering.

[4]  Josef Kittler,et al.  Divergence Based Feature Selection for Multimodal Class Densities , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Thomas G. Dietterich,et al.  Learning Boolean Concepts in the Presence of Many Irrelevant Features , 1994, Artif. Intell..

[6]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[7]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[8]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[9]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.

[10]  Wei-Ying Ma,et al.  An Evaluation on Feature Selection for Text Clustering , 2003, ICML.

[11]  Patrick Haffner,et al.  Support vector machines for histogram-based image classification , 1999, IEEE Trans. Neural Networks.

[12]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[13]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[14]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[15]  François Yvon,et al.  Inference and evaluation of the multinomial mixture model for text clustering , 2006, Inf. Process. Manag..

[16]  D. Angluin,et al.  Learning From Noisy Examples , 1988, Machine Learning.

[17]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[18]  Soon Myoung Chung,et al.  Text Clustering with Feature Selection by Using Statistical Data , 2008, IEEE Transactions on Knowledge and Data Engineering.

[19]  Anil K. Jain,et al.  Simultaneous feature selection and clustering using mixture models , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Anil K. Jain,et al.  Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Nir Friedman,et al.  Discretizing Continuous Attributes While Learning Bayesian Networks , 1996, ICML.

[22]  Henry Tirri,et al.  On predictive distributions and Bayesian networks , 2000, Stat. Comput..

[23]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[24]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[25]  Ata Kabán,et al.  Model-Based Estimation of Word Saliency in Text , 2006, Discovery Science.

[26]  A. Barron,et al.  Jeffreys' prior is asymptotically least favorable under entropy risk , 1994 .

[27]  P. Deb Finite Mixture Models , 2008 .

[28]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[29]  Raphail E. Krichevsky,et al.  The performance of universal encoding , 1981, IEEE Trans. Inf. Theory.

[30]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[31]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[32]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[33]  David Kauchak,et al.  Modeling word burstiness using the Dirichlet distribution , 2005, ICML.

[34]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[35]  Ron Kohavi,et al.  Feature Subset Selection Using the Wrapper Method: Overfitting and Dynamic Search Space Topology , 1995, KDD.

[36]  Thomas Hofmann,et al.  Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization , 1999, NIPS.

[37]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[38]  Thomas G. Dietterich,et al.  Learning with Many Irrelevant Features , 1991, AAAI.

[39]  W. Scott Spangler,et al.  Feature Weighting in k-Means Clustering , 2003, Machine Learning.

[40]  Anirban Dasgupta,et al.  Feature selection methods for text classification , 2007, KDD '07.

[41]  David Maxwell Chickering,et al.  Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden Variables , 1997, Machine Learning.

[42]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[43]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[44]  Nizar Bouguila,et al.  Unsupervised learning of a finite discrete mixture: Applications to texture modeling and image databases summarization , 2007, J. Vis. Commun. Image Represent..

[45]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[46]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[47]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[48]  Yoav Freund,et al.  Predicting a binary sequence almost as well as the optimal biased coin , 2003, COLT '96.

[49]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[50]  David J. C. MacKay,et al.  Choice of Basis for Laplace Approximation , 1998, Machine Learning.

[51]  Martin Szummer,et al.  Indoor-outdoor image classification , 1998, Proceedings 1998 IEEE International Workshop on Content-Based Access of Image and Video Database.

[52]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[53]  Yoram Singer,et al.  Efficient Bayesian Parameter Estimation in Large Discrete Domains , 1998, NIPS.

[54]  Steven Salzberg,et al.  A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features , 2004, Machine Learning.

[55]  Hans-Peter Kriegel,et al.  A Probabilistic Clustering-Projection Model for Discrete Data , 2005, PKDD.

[56]  Charles Elkan,et al.  Deriving TF-IDF as a Fisher Kernel , 2005, SPIRE.

[57]  Thomas L. Griffiths,et al.  Using Vocabulary Knowledge in Bayesian Multinomial Estimation , 2001, NIPS.

[58]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[59]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[60]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[61]  Ian H. Witten,et al.  The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression , 1991, IEEE Trans. Inf. Theory.

[62]  Manoranjan Dash,et al.  Feature Selection for Clustering , 2009, Encyclopedia of Database Systems.

[63]  David W. Aha,et al.  A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms , 1997, Artificial Intelligence Review.

[64]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[65]  Nizar Bouguila,et al.  Unsupervised Feature Selection for Accurate Recommendation of High-Dimensional Image Data , 2007, NIPS.

[66]  Shivakumar Vaithyanathan,et al.  Generalized Model Selection for Unsupervised Learning in High Dimensions , 1999, NIPS.

[67]  Anil K. Jain,et al.  On image classification: city images vs. landscapes , 1998, Pattern Recognit..

[68]  Nizar Bouguila,et al.  Unsupervised selection of a finite Dirichlet mixture model: an MML-based approach , 2006, IEEE Transactions on Knowledge and Data Engineering.

[69]  Igor Kononenko,et al.  On Biases in Estimating Multi-Valued Attributes , 1995, IJCAI.

[70]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[71]  David J. Miller,et al.  Unsupervised learning of parsimonious mixtures on large spaces with integrated feature and component selection , 2006, IEEE Transactions on Signal Processing.

[72]  Cullen Schaffer,et al.  Selecting a classification method by cross-validation , 1993, Machine Learning.

[73]  Nizar Bouguila,et al.  A Graphical Model for Content Based Image Suggestion and Feature Selection , 2007, PKDD.

[74]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[75]  Nizar Bouguila,et al.  A hybrid SEM algorithm for high-dimensional unsupervised learning using a finite generalized Dirichlet mixture , 2006, IEEE Transactions on Image Processing.

[76]  Charles Elkan,et al.  Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution , 2006, ICML.

[77]  Anil K. Jain,et al.  Feature Selection: Evaluation, Application, and Small Sample Performance , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[78]  Josef Kittler,et al.  Feature selection based on the approximation of class densities by finite mixtures of special type , 1995, Pattern Recognit..

[79]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.