Global and componentwise extrapolations for accelerating training of Bayesian networks and conditional random fields

The triple jump extrapolation method is an effective approximation of Aitken’s acceleration that can accelerate the convergence of many algorithms for data mining, including EM and generalized iterative scaling (GIS). It has two options—global and componentwise extrapolation. Empirical studies showed that neither can dominate the other and it is not known which one is better under what condition. In this paper, we investigate this problem and conclude that, when the Jacobian is (block) diagonal, componentwise extrapolation will be more effective. We derive two hints to determine the block diagonality. The first hint is that when we have a highly sparse data set, the Jacobian of the EM mapping for training a Bayesian network will be block diagonal. The second is that the block diagonality of the Jacobian of the GIS mapping for training CRF is negatively correlated with the strength of feature dependencies. We empirically verify these hints with controlled and real-world data sets and show that our hints can accurately predict which method will be superior. We also show that both global and componentwise extrapolation can provide substantial acceleration. In particular, when applied to train large-scale CRF models, the GIS variant accelerated by componentwise extrapolation not only outperforms its global extrapolation counterpart, as our hint predicts, but can also compete with limited-memory BFGS (L-BFGS), the de facto standard for CRF training, in terms of both computational efficiency and F-scores. Though none of the above methods are as fast as stochastic gradient descent (SGD), careful tuning is required for SGD and the results given in this paper provide a useful foundation for automatic tuning.

[1]  Ch. Roland,et al.  Squared Extrapolation Methods (SQUAREM): A New Class of Simple and Efficient Numerical Schemes for Accelerating the Convergence of the EM Algorithm , 2004 .

[2]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[3]  Zoubin Ghahramani,et al.  On the Convergence of Bound Optimization Algorithms , 2002, UAI.

[4]  Stuart J. Russell,et al.  Adaptive Probabilistic Networks with Hidden Variables , 1997, Machine Learning.

[5]  Ch. Roland,et al.  Acceleration schemes with application to the EM algorithm , 2007, Comput. Stat. Data Anal..

[6]  Chun-Nan Hsu,et al.  Triple jump acceleration for the EM algorithm , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[7]  Chun-Nan Hsu,et al.  Integrating high dimensional bi-directional parsing models for gene mention tagging , 2008, ISMB.

[8]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[9]  Masahiro Kuroda,et al.  Accelerating the convergence of the EM algorithm using the vector epsilon , 2006, Comput. Stat. Data Anal..

[10]  Masahiro Kurodaa,et al.  Accelerating the convergence of the EM algorithm using the vector algorithm , 2006 .

[11]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[12]  J. Miller Numerical Analysis , 1966, Nature.

[13]  R. Jennrich,et al.  Acceleration of the EM Algorithm by using Quasi‐Newton Methods , 1997 .

[14]  Gene H. Golub,et al.  Matrix computations , 1983 .

[15]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[16]  S. R. Searle,et al.  Matrix Algebra Useful for Statistics , 1982 .

[17]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[18]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[19]  Bo Thiesson,et al.  Accelerating EM for Large Databases , 2001, Machine Learning.

[20]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[21]  George Kapetanios,et al.  On Testing for Diagonality of Large Dimensional Covariance Matrices , 2004 .

[22]  Abel M. Rodrigues Matrix Algebra Useful for Statistics , 2007 .

[23]  Chun-Nan Hsu,et al.  Training Conditional Random Fields by Periodic Step Size Adaptation for Large-Scale Text Mining , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[24]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[25]  Gregory F. Cooper,et al.  A Bayesian Method for the Induction of Probabilistic Networks from Data , 1992 .

[26]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[27]  Judea Pearl,et al.  Chapter 2 – BAYESIAN INFERENCE , 1988 .

[28]  Burr Settles,et al.  Biomedical Named Entity Recognition using Conditional Random Fields and Rich Feature Sets , 2004, NLPBA/BioNLP.

[29]  J. Douglas Faires,et al.  Numerical Analysis , 1981 .

[30]  Cheng-Ju Kuo,et al.  Rich Feature Set, Unification of Bidirectional Parsing and Dictionary Filtering for High F-Score Gene Mention Tagging. , 2007 .

[31]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[32]  Ruslan Salakhutdinov,et al.  Adaptive Overrelaxed Bound Optimization Methods , 2003, ICML.

[33]  Chun-Nan Hsu,et al.  TJ2aEM: Targeted Aggressive Extrapolation Method for Accelerating the EM Algorithm , 2007 .

[34]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[35]  Stuart J. Russell,et al.  Local Learning in Probabilistic Networks with Hidden Variables , 1995, IJCAI.

[36]  David M. Rocke,et al.  Some computational issues in cluster analysis with no a priori metric , 1999 .

[37]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[38]  Xiao-Li Meng,et al.  On the global and componentwise rates of convergence of the EM algorithm , 1994 .

[39]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[40]  Eric Bauer,et al.  Update Rules for Parameter Estimation in Bayesian Networks , 1997, UAI.

[41]  T. Louis Finding the Observed Information Matrix When Using the EM Algorithm , 1982 .

[42]  C. Fraley On Computing the Largest Fraction of Missing Information for the EM Algorithm and the Worst Linear F , 1998 .