Optimal Kullback–Leibler Aggregation via Information Bottleneck

In this paper, we present a method for reducing a regular, discrete-time Markov chain (DTMC) to another DTMC with a given, typically much smaller number of states. The cost of reduction is defined as the Kullback-Leibler divergence rate between a projection of the original process through a partition function and a DTMC on the correspondingly partitioned state space. Finding the reduced model with minimal cost is computationally expensive, as it requires an exhaustive search among all state space partitions, and an exact evaluation of the reduction cost for each candidate partition. Our approach deals with the latter problem by minimizing an upper bound on the reduction cost instead of minimizing the exact cost. The proposed upper bound is easy to compute and it is tight if the original chain is lumpable with respect to the partition. Then, we express the problem in the form of information bottleneck optimization, and propose using the agglomerative information bottleneck algorithm for searching a suboptimal partition greedily, rather than exhaustively. The theory is illustrated with examples and one application scenario in the context of modeling bio-molecular interactions.

[1]  Nir Friedman,et al.  Mean Field Variational Approximation for Continuous-Time Bayesian Networks , 2009, J. Mach. Learn. Res..

[2]  B. Nordstrom FINITE MARKOV CHAINS , 2005 .

[3]  P. Billingsley,et al.  Probability and Measure , 1980 .

[4]  Robert E. Mahony,et al.  Lumpable hidden Markov models-model reduction and reduced complexity filtering , 2000, IEEE Trans. Autom. Control..

[5]  Gernot Kubin,et al.  Signal Enhancement as Minimization of Relevant Information Loss , 2012, ArXiv.

[6]  Naftali Tishby,et al.  Agglomerative Information Bottleneck , 1999, NIPS.

[7]  Bernhard C. Geiger,et al.  Lumpings of Markov chains and entropy rate loss , 2012 .

[8]  Amiel Feinstein,et al.  Information and information stability of random variables and processes , 1964 .

[9]  Jacob Goldberger,et al.  Information Theoretic Pairwise Clustering , 2013, SIMBAD.

[10]  Mathukumalli Vidyasagar,et al.  Reduced-order modeling of Markov and hidden Markov processes via aggregation , 2010, 49th IEEE Conference on Decision and Control (CDC).

[11]  Robert M. Gray,et al.  Probability, Random Processes, And Ergodic Properties , 1987 .

[12]  Tiejun Li,et al.  Optimal partition and effective dynamics of complex networks , 2008, Proceedings of the National Academy of Sciences.

[13]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[14]  Mathukumalli Vidyasagar Kullback-Leibler divergence rate between probability distributions on sets of different cardinalities , 2010, 49th IEEE Conference on Decision and Control (CDC).

[15]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[16]  Mathukumalli Vidyasagar A Metric Between Probability Distributions on Finite Sets of Different Cardinalities and Applications to Order Reduction , 2012, IEEE Transactions on Automatic Control.

[17]  Thomas M. Cover,et al.  Elements of information theory (2. ed.) , 2006 .

[18]  Tatjana Petrov,et al.  Formal reductions of stochastic rule-based models of biochemical systems , 2013 .

[19]  Kun Deng,et al.  Model reduction of Markov chains via low-rank approximation , 2012, 2012 American Control Conference (ACC).

[20]  Yunwen Xu,et al.  Aggregation of Graph Models and Markov Chains by Deterministic Annealing , 2014, IEEE Transactions on Automatic Control.

[21]  Jianbo Shi,et al.  Learning Segmentation by Random Walks , 2000, NIPS.

[22]  Naftali Tishby,et al.  Speaker recognition by Gaussian information bottleneck , 2009, INTERSPEECH.

[23]  Qing-Shan Jia,et al.  On State Aggregation to Approximate Complex Value Functions in Large-Scale Markov Decision Processes , 2011, IEEE Transactions on Automatic Control.

[24]  R. Gray Entropy and Information Theory , 1990, Springer New York.

[25]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[26]  H. Khalil,et al.  Aggregation of the policy iteration method for nearly completely decomposable Markov chains , 1991 .

[27]  Naftali Tishby,et al.  Data Clustering by Markovian Relaxation and the Information Bottleneck Method , 2000, NIPS.

[28]  Markos A. Katsoulakis,et al.  Information Loss in Coarse-Graining of Stochastic Particle Dynamics , 2006 .

[29]  Thordur Runolfsson,et al.  Model reduction of nonreversible Markov chains , 2007, 2007 46th IEEE Conference on Decision and Control.

[30]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[31]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[32]  J. Kieffer,et al.  Markov Channels are Asymptotically Mean Stationary , 1981 .

[33]  Daniel T Gillespie,et al.  Stochastic simulation of chemical kinetics. , 2007, Annual review of physical chemistry.

[34]  Fady Alajaji,et al.  The Kullback-Leibler divergence rate between Markov sources , 2004, IEEE Transactions on Information Theory.

[35]  Satosi Watanabe,et al.  Loss and Recovery of Information by Coarse Observation of Stochastic Chain , 1960, Inf. Control..

[36]  Bernhard C. Geiger,et al.  Lumpings of Markov chains, entropy rate preservation, and higher-order lumpability , 2014 .

[37]  Darren J. Wilkinson Stochastic Modelling for Systems Biology , 2006 .

[38]  Heinz Koeppl,et al.  Lumpability abstractions of rule-based systems , 2010, Theor. Comput. Sci..

[39]  Michael Wohlmayr,et al.  Speech — Nonspeech discrimination based on speech-relevant spectrogram modulations , 2007, 2007 15th European Signal Processing Conference.

[40]  Chris Wiggins,et al.  An Information-Theoretic Derivation of Min-Cut-Based Clustering , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Sean P. Meyn,et al.  Optimal Kullback-Leibler Aggregation via Spectral Theory of Markov Chains , 2011, IEEE Transactions on Automatic Control.