Unsupervised learning of finite mixture models with deterministic annealing for large-scale data analysis

The finite mixture model, one of the most fundamental foundations in the fields of data mining and machine learning areas to access the essential structures of observed random sample data, aims at building a probabilistic model in which random sample data is described as a probabilistic distribution represented by mixtures of other distributions called latent components. The finite mixture model provides a convenient way to explain random phenomena of observed sample data in a generative process of finite mixtures. The main challenges in the finite mixture model are (i) to search an optimal model parameter set from a large problem space and (ii) to find a generalized model to avoid overfitting. The standard method used to fit a finite mixture model is an Expectation-Maximization (EM) algorithm. However, an EM-based algorithm finds only locally optimized solutions and thus the quality of the answer is heavily affected by an initial condition (known as a local optimum problem). Moreover, it can cause an overfitting problem. We address these problems by using the novel optimization heuristic, known as Deterministic Annealing (DA), which has been proven its success to avoid local optima and been widely used in many data mining algorithms. More specifically, in this thesis, we focus two well-known data mining algorithms based on the finite mixture model: Generative Topographic Mapping (GTM) for dimension reduction and data visualization and Probabilistic Latent Semantic Analysis (PLSA) for text mining and information retrieval. Those two algorithms have been widely used in the field of data visualization and text mining but still suffer from the local optimum problem due to the use of the EM algorithm in their original developments. We extend those algorithms by using the DA algorithm to improve its quality in parameter estimation and provide overfitting avoidance. We show various experiment results to show the improvements.

[1]  Dennis Gannon,et al.  V-Lab-Protein: Virtual Collaborative Lab for protein sequence analysis , 2007, 2007 IEEE International Conference on Bioinformatics and Biomedicine Workshops.

[2]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[3]  Jun Xu,et al.  Packet vaccine: black-box exploit detection and signature generation , 2006, CCS '06.

[4]  Marlon E. Pierce,et al.  BioVLAB-Microarray: Microarray Data Analysis in Virtual Environment , 2008, 2008 IEEE Fourth International Conference on eScience.

[5]  David A. Smith,et al.  Minimum Risk Annealing for Training Log-Linear Models , 2006, ACL.

[6]  Markus Jakobsson,et al.  Balancing auditability and privacy in vehicular networks , 2005, Q2SWinet '05.

[7]  Jong Youl Choi,et al.  SpyShield: Preserving Privacy from Spy Add-Ons , 2007, RAID.

[8]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[9]  Geoffrey C. Fox,et al.  High Performance Dimension Reduction and Visualization for Large High-Dimensional Data Analysis , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[10]  Noah A. Smith,et al.  Annealing Techniques For Unsupervised Statistical Language Learning , 2004, ACL.

[11]  Geoffrey C. Fox,et al.  Using Web 2.0 for scientific applications and scientific communities , 2009, Concurr. Comput. Pract. Exp..

[12]  Joachim M. Buhmann,et al.  Multidimensional Scaling by Deterministic Annealing , 1997, EMMCVPR.

[13]  Geoffrey C. Fox,et al.  Vector quantization by deterministic annealing , 1992, IEEE Trans. Inf. Theory.

[14]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Luonan Chen,et al.  Protein structure alignment by deterministic annealing , 2005, Bioinform..

[16]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[18]  J. J. Kosowsky,et al.  Statistical Physics Algorithms That Converge , 1994, Neural Computation.

[19]  Christopher M. Bishop,et al.  Mixtures of Probabilistic Principal Component Analyzers , 1999, Neural Computation.

[20]  Markus Jakobsson,et al.  Tamper-Evident Digital Signature Protecting Certification Authorities Against Malware , 2006, 2006 2nd IEEE International Symposium on Dependable, Autonomic and Secure Computing.

[21]  K. Rose Deterministic annealing for clustering, compression, classification, regression, and related optimization problems , 1998, Proc. IEEE.

[22]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[23]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[24]  Tom M. Mitchell,et al.  Using unlabeled data to improve text classification , 2001 .

[25]  Joachim M. Buhmann,et al.  Pairwise Data Clustering by Deterministic Annealing , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  E. Jaynes On the rationale of maximum-entropy methods , 1982, Proceedings of the IEEE.

[27]  Geoffrey C. Fox,et al.  A deterministic annealing approach to clustering , 1990, Pattern Recognit. Lett..

[28]  Bin Chen,et al.  Browsing large scale cheminformatics data with dimension reduction , 2010, HPDC '10.

[29]  Geoffrey C. Fox,et al.  Building a Grid Portal for Teragrid's Big Red , 2007 .

[30]  Geoffrey C. Fox,et al.  Dimension reduction and visualization of large high-dimensional data via interpolation , 2010, HPDC '10.

[31]  J.Y. Choi,et al.  Collective Collaborative Tagging System , 2008, 2008 Grid Computing Environments Workshop.

[32]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[33]  Markus Jakobsson,et al.  Auditable Privacy: On Tamper-Evident Mix Networks , 2006, Financial Cryptography.

[34]  Rose,et al.  Statistical mechanics and phase transitions in clustering. , 1990, Physical review letters.

[35]  S. Sathiya Keerthi,et al.  Deterministic annealing for semi-supervised kernel machines , 2006, ICML.

[36]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[37]  Craig Shue,et al.  Analysis of IPSec overheads for VPN servers , 2005, 1st IEEE ICNP Workshop on Secure Network Protocols, 2005. (NPSec)..

[38]  Geoffrey C. Fox,et al.  Social networking for scientists using tagging and shared bookmarks: a Web 2.0 application , 2008, 2008 International Symposium on Collaborative Technologies and Systems.

[39]  Naonori Ueda,et al.  Deterministic annealing EM algorithm , 1998, Neural Networks.

[40]  Geoffrey C. Fox,et al.  Hybrid cloud and cluster computing paradigms for life science applications , 2010, BMC Bioinformatics.

[41]  Pierre Hansen,et al.  NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[42]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[43]  Christopher M. Bishop,et al.  GTM: A Principled Alternative to the Self-Organizing Map , 1996, NIPS.

[44]  P. Deb Finite Mixture Models , 2008 .

[45]  Geoffrey C. Fox,et al.  Generative topographic mapping by deterministic annealing , 2010, ICCS.

[46]  Ninghui Li,et al.  PRECIP: Towards Practical and Retrofittable Confidential Information Protection , 2008, NDSS.

[47]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[48]  Jun Xu,et al.  Fast and Black-box Exploit Detection and Signature Generation for Commodity Software , 2008, TSEC.

[49]  Christopher M. Bishop,et al.  GTM: The Generative Topographic Mapping , 1998, Neural Computation.

[50]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[51]  XuLei Yang,et al.  A robust deterministic annealing algorithm for data clustering , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[52]  Geoffrey C. Fox,et al.  Cloud computing paradigms for pleasingly parallel biomedical applications , 2010, HPDC '10.

[53]  J. Palous,et al.  Machine Learning and Data Mining , 2002 .