Positive vectors clustering using inverted Dirichlet finite mixture models

In this work we present an unsupervised algorithm for learning finite mixture models from multivariate positive data. Indeed, this kind of data appears naturally in many applications, yet it has not been adequately addressed in the past. This mixture model is based on the inverted Dirichlet distribution, which offers a good representation and modeling of positive non-Gaussian data. The proposed approach for estimating the parameters of an inverted Dirichlet mixture is based on the maximum likelihood (ML) using Newton Raphson method. We also develop an approach, based on the minimum message length (MML) criterion, to select the optimal number of clusters to represent the data using such a mixture. Experimental results are presented using artificial histograms and real data sets. The challenging problem of software modules classification is investigated within the proposed statistical framework, also.

[1]  H. Bozdogan Model selection and Akaike's Information Criterion (AIC): The general theory and its analytical extensions , 1987 .

[2]  Padhraic Smyth,et al.  Statistical inference and data mining , 1996, CACM.

[3]  Nizar Bouguila,et al.  A Hybrid Feature Extraction Selection Approach for High-Dimensional Non-Gaussian Data Clustering , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[5]  David Haussler,et al.  Mining scientific data , 1996, CACM.

[6]  Nizar Bouguila,et al.  MML-Based Approach for Finite Dirichlet Mixture Estimation and Selection , 2005, MLDM.

[7]  Nizar Bouguila,et al.  Dirichlet-based probability model applied to human skin detection [image skin detection] , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[9]  Padhraic Smyth,et al.  Statistical Themes and Lessons for Data Mining , 2004, Data Mining and Knowledge Discovery.

[10]  Nizar Bouguila,et al.  Unsupervised selection of a finite Dirichlet mixture model: an MML-based approach , 2006, IEEE Transactions on Knowledge and Data Engineering.

[11]  A. Hamza,et al.  Software modules categorization through likelihood and bayesian analysis of finite dirichlet mixtures , 2010 .

[12]  Hedayat Yassaee Inverted dirichlet distribution and multivariate logistic distribution , 1974 .

[13]  Simon P. Wilson,et al.  Software Reliability Modeling , 1994 .

[14]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[16]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[17]  H. Akaike A new look at the statistical model identification , 1974 .

[18]  Adam A. Porter,et al.  Empirically guided software development using metric-based classification trees , 1990, IEEE Software.

[19]  Nizar Bouguila,et al.  Novel Mixtures Based on the Dirichlet Distribution: Application to Data and Image Classification , 2003, MLDM.

[20]  Victor R. Basili,et al.  An Empirical Study of a Syntactic Complexity Family , 1983, IEEE Transactions on Software Engineering.

[21]  Nizar Bouguila,et al.  A probabilistic approach for shadows modeling and detection , 2005, IEEE International Conference on Image Processing 2005.

[22]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[23]  Nizar Bouguila,et al.  Unsupervised learning of a finite mixture model based on the Dirichlet distribution and its application , 2004, IEEE Transactions on Image Processing.

[24]  Taghi M. Khoshgoftaar,et al.  A practical classification-rule for software-quality models , 2000, IEEE Trans. Reliab..

[25]  Tzay Y. Young,et al.  Stochastic estimation of a mixture of normal density functions using an information criterion , 1970, IEEE Trans. Inf. Theory.

[26]  C. S. Wallace,et al.  Statistical and Inductive Inference by Minimum Message Length (Information Science and Statistics) , 2005 .

[27]  Nizar Bouguila,et al.  Using unsupervised learning of a finite Dirichlet mixture model to improve pattern recognition applications , 2005, Pattern Recognit. Lett..

[28]  George E. P. Box,et al.  Statistics and Quality Improvement , 1994 .

[29]  Victor R. Basili,et al.  Developing Interpretable Models with Optimized Set Reduction for Identifying High-Risk Software Components , 1993, IEEE Trans. Software Eng..

[30]  Jeff Tian,et al.  Measurement and defect modeling for a legacy software system , 1995, Ann. Softw. Eng..

[31]  Roger S. Pressman,et al.  Software Engineering: A Practitioner's Approach , 1982 .

[32]  George G. Tiao,et al.  The Inverted Dirichlet Distribution with Applications , 1965 .

[33]  Nizar Bouguila,et al.  Online clustering via finite mixtures of Dirichlet and minimum message length , 2006, Eng. Appl. Artif. Intell..

[34]  Taghi M. Khoshgoftaar,et al.  Classification of Fault-Prone Software Modules: Prior Probabilities, Costs, and Model Evaluation , 1998, Empirical Software Engineering.

[35]  S. Ganesalingam Classification and Mixture Approaches to Clustering Via Maximum Likelihood , 1989 .

[36]  Nizar Bouguila,et al.  On Fitting Finite Dirichlet Mixture Using ECM and MML , 2005, ICAPR.

[37]  M. Ghorbel On the Inverted Dirichlet Distribution , 2009 .

[38]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[39]  F. Graybill,et al.  Matrices with Applications in Statistics. , 1984 .