A maximum likelihood approximation method for Dirichlet's parameter estimation

Dirichlet distributions are natural choices to analyse data described by frequencies or proportions since they are the simplest known distributions for such data apart from the uniform distribution. They are often used whenever proportions are involved, for example, in text-mining, image analysis, biology or as a prior of a multinomial distribution in Bayesian statistics. As the Dirichlet distribution belongs to the exponential family, its parameters can be easily inferred by maximum likelihood. Parameter estimation is usually performed with the Newton-Raphson algorithm after an initialisation step using either the moments or Ronning's methods. However this initialisation can result in parameters that lie outside the admissible region. A simple and very efficient alternative based on a maximum likelihood approximation is presented. The advantages of the presented method compared to two other methods are demonstrated on synthetic data sets as well as for a practical biological problem: the clustering of protein sequences based on their amino acid compositions.

[1]  A. Narayanan,et al.  Small sample properties of parameter estimation in the Dirichlet distribution , 1991 .

[2]  J. Valcárcel,et al.  The SR protein family: pleiotropic functions in pre-mRNA splicing. , 1996, Trends in biochemical sciences.

[3]  G. Celeux,et al.  Comparison of the mixture and the classification maximum likelihood in cluster analysis , 1993 .

[4]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[5]  G. Weiss,et al.  Small sample comparison of estimation methods for the beta distribution , 1980 .

[6]  G. Celeux,et al.  A Classification EM algorithm for clustering and two stochastic versions , 1992 .

[7]  P. Green,et al.  On Bayesian Analysis of Mixtures with an Unknown Number of Components (with discussion) , 1997 .

[8]  DM Blei,et al.  Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span , 2006, BMC Bioinformatics.

[9]  B Marshall,et al.  Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource , 2004, Nucleic Acids Res..

[10]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[11]  D. Nižetić,et al.  Structural Organization and Regulation of the Small Proline-rich Family of Cornified Envelope Precursors Suggest a Role in Adaptive Barrier Function* , 2001, The Journal of Biological Chemistry.

[12]  Nizar Bouguila,et al.  Unsupervised learning of a finite mixture model based on the Dirichlet distribution and its application , 2004, IEEE Transactions on Image Processing.

[13]  A. Narayanan Maximum Likelihood Estimation of the Parameters of the Dirichlet Distribution , 1991 .

[14]  Irene A. Stegun,et al.  Handbook of Mathematical Functions. , 1966 .

[15]  A. Narayanan A note on parameter estimation in the multivariate beta distribution , 1992 .

[16]  G. Ronning Maximum likelihood estimation of dirichlet distributions , 1989 .

[17]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[18]  Nizar Bouguila,et al.  Unsupervised selection of a finite Dirichlet mixture model: an MML-based approach , 2006, IEEE Transactions on Knowledge and Data Engineering.

[19]  M. Grunstein,et al.  Functions of site-specific histone acetylation and deacetylation. , 2007, Annual review of biochemistry.

[20]  Ronald W. Davis,et al.  Transcriptional regulation and function during the human cell cycle , 2001, Nature Genetics.

[21]  Guillaume Laval,et al.  Maximum-likelihood and markov chain monte carlo approaches to estimate inbreeding and effective size from allele frequency changes. , 2003, Genetics.

[22]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[23]  H. Akaike A new look at the statistical model identification , 1974 .

[24]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[25]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .