Extracting Weighted Automata for Approximate Minimization in Language Modelling

In this paper we study the approximate minimization problem for language modelling. We assume we are given some language model as a black box. The objective is to obtain a weighted finite automaton (WFA) that fits within a given size constraint and which mimics the behaviour of the original model while minimizing some notion of distance between the black box and the extracted WFA. We provide an algorithm for the approximate minimization of black boxes trained for language modelling of sequential data over a one-letter alphabet. By reformulating the problem in terms of Hankel matrices, we leverage classical results on the approximation of Hankel operators, namely the celebrated Adamyan-ArovKrein (AAK) theory. This allows us to use the spectral norm to measure the distance between the black box and the WFA. We provide theoretical guarantees to study the potentially infinite-rank Hankel matrix of the black box, without accessing the training data, and we prove that our method returns an asymptotically-optimal approximation.

[1]  J. Ball,et al.  Optimal Hankel Norm model reductions and Weiner-Hopf factorization I: the canonical case , 1987 .

[2]  Doina Precup,et al.  Optimal Spectral-Norm Approximate Minimization of Weighted Finite Automata , 2021, ICALP.

[3]  C. Lee Giles,et al.  Constructing deterministic finite-state automata in recurrent neural networks , 1996, JACM.

[4]  M. Kreĭn,et al.  ANALYTIC PROPERTIES OF SCHMIDT PAIRS FOR A HANKEL OPERATOR AND THE GENERALIZED SCHUR-TAKAGI PROBLEM , 1971 .

[5]  Guanrong Chen,et al.  Discrete H∞ Optimization , 1997 .

[6]  Brian Roark,et al.  Approximating Probabilistic Models as Weighted Finite Automata , 2019, CL.

[7]  R. Curtain,et al.  Realisation and approximation of linear infinite-dimensional systems with error bounds , 1988 .

[8]  Gelu Popescu Multivariable Nehari problem and interpolation , 2003 .

[9]  François Denis,et al.  Rational stochastic languages , 2006, ArXiv.

[10]  L. Hörmander,et al.  A Remark on Perturbations of Compact Operators. , 1994 .

[11]  Eran Yahav,et al.  On the Practical Computational Power of Finite Precision RNNs for Language Recognition , 2018, ACL.

[12]  Doina Precup,et al.  Singular value automata and approximate minimization , 2017, Mathematical Structures in Computer Science.

[13]  Xue Liu,et al.  An Empirical Evaluation of Rule Extraction from Recurrent Neural Networks , 2017, Neural Computation.

[14]  Guoxiang Gu,et al.  All optimal Hankel-norm approximations and their error bounds in discrete-time , 2005 .

[15]  Kehe Zhu Operator theory in function spaces , 1990 .

[16]  Vlad Ionescu,et al.  The four-block Adamjan–Arov–Krein problem for discrete-time systems , 2001 .

[17]  Noah A. Smith,et al.  A Formal Hierarchy of RNN Architectures , 2020, ACL.

[18]  J. Neumann,et al.  Uber merkwürdige diskrete Eigenwerte. Uber das Verhalten von Eigenwerten bei adiabatischen Prozessen , 1929 .

[19]  Jack W. Carlyle,et al.  Realizations by Stochastic Finite Automata , 1971, J. Comput. Syst. Sci..

[20]  The Hankel matrix rank theorem revisited , 2017 .

[21]  Sun-Yuan Kung,et al.  Optimal Hankel-norm model reductions-scalar systems , 1980 .

[22]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[23]  Doina Precup,et al.  A Canonical Form for Weighted Automata and Applications to Approximate Minimization , 2015, 2015 30th Annual ACM/IEEE Symposium on Logic in Computer Science.

[24]  C. Eckart,et al.  The approximation of one matrix by another of lower rank , 1936 .

[25]  D. Lin,et al.  Optimal Hankel-norm model reductions: Multivariable systems , 1980, 1980 19th IEEE Conference on Decision and Control including the Symposium on Adaptive Processes.

[26]  H. T. Kung,et al.  Fast Algorithms for Partial Fraction Decomposition , 1977, SIAM J. Comput..

[27]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[28]  Giovanni Pighizzini,et al.  Investigations on Automata and Languages Over a Unary Alphabet , 2014, Int. J. Found. Comput. Sci..

[29]  Borja Balle,et al.  Approximate minimization of weighted tree automata , 2020 .

[30]  V. Peller Hankel Operators and Their Applications , 2003, IEEE Transactions on Automatic Control.

[31]  Suk-Geun Hwang,et al.  Cauchy's Interlace Theorem for Eigenvalues of Hermitian Matrices , 2004, Am. Math. Mon..

[32]  Athanasios C. Antoulas,et al.  Approximation of Large-Scale Dynamical Systems , 2005, Advances in Design and Control.

[33]  Qin Lin,et al.  Interpreting Finite Automata for Sequential Data , 2016, NIPS 2016.

[34]  C. Lee Giles,et al.  Learning and Extracting Finite State Automata with Second-Order Recurrent Neural Networks , 1992, Neural Computation.

[35]  Doina Precup,et al.  Connecting Weighted Automata and Recurrent Neural Networks through Spectral Learning , 2018, AISTATS.

[36]  Continuity of Best Hankel Approximation and Convergence of Near-BestApproximants , 1994 .

[37]  Remi Eyraud,et al.  Distillation of Weighted Automata from Recurrent Neural Networks using a Spectral Approach , 2020, Mach. Learn..

[38]  Colin de la Higuera,et al.  Distance and Equivalence between Finite State Machines and Recurrent Neural Networks: Computational results , 2020, ArXiv.

[39]  K. Glover All optimal Hankel-norm approximations of linear multivariable systems and their L, ∞ -error bounds† , 1984 .

[40]  Charles K. Chui,et al.  Rate of convergence of schmidt pairs and rational functions corresponding to best approximants of truncated hankel operators , 1992, Math. Control. Signals Syst..

[41]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[42]  Terence Tao,et al.  Random matrices have simple spectrum , 2014, Comb..

[43]  T. Tao Topics in Random Matrix Theory , 2012 .

[44]  Nikolai Nikolski,et al.  Operators, Functions, and Systems: An Easy Reading , 2002 .

[45]  Stéphane Ayache,et al.  Explaining Black Boxes on Sequential Data using Weighted Automata , 2018, ICGI.

[46]  Tosio Kato Perturbation theory for linear operators , 1966 .

[47]  Mehryar Mohri,et al.  Rational Kernels: Theory and Algorithms , 2004, J. Mach. Learn. Res..

[48]  Raphaël Bailly Quadratic Weighted Automata: Spectral Algorithm and Likelihood Maximization , 2011, ACML 2011.

[49]  M. Kreĭn,et al.  Introduction to the theory of linear nonselfadjoint operators , 1969 .

[50]  Guanrong Chen,et al.  Discrete H∞ Optimization: With Applications in Signal Processing and Control Systems , 1997 .

[51]  Z. Nehari On Bounded Bilinear Forms , 1957 .

[52]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[53]  Taro Sekiyama,et al.  Weighted Automata Extraction from Recurrent Neural Networks via Regression on State Spaces , 2019, AAAI.

[54]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[55]  Eran Yahav,et al.  Learning Deterministic Weighted Automata with Queries and Counterexamples , 2019, NeurIPS.

[56]  Eran Yahav,et al.  Extracting Automata from Recurrent Neural Networks Using Queries and Counterexamples , 2017, ICML.

[57]  Charles K. Chui,et al.  System reduction via truncated Hankel matrices , 1991, Math. Control. Signals Syst..