Inducing Features of Random Fields

We present a technique for constructing random fields from a set of training samples. The learning paradigm builds increasingly complex fields by allowing potential functions, or features, that are supported by increasingly large subgraphs. Each feature has a weight that is trained by minimizing the Kullback-Leibler divergence between the model and the empirical distribution of the training data. A greedy algorithm determines how features are incrementally added to the field and an iterative scaling algorithm is used to estimate the optimal values of the weights. The random field models and techniques introduced in this paper differ from those common to much of the computer vision literature in that the underlying random fields are non-Markovian and have a large number of parameters that must be estimated. Relations to other learning approaches, including decision trees, are given. As a demonstration of the method, we describe its application to the problem of automatic word classification in natural language processing.

[1]  David T. Brown,et al.  A Note on Approximations to Discrete Probability Distributions , 1959, Inf. Control..

[2]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[3]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[4]  I. Csiszár $I$-Divergence Geometry of Probability Distributions and Minimization Problems , 1975 .

[5]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[6]  P. Diaconis,et al.  Conjugate Priors for Exponential Families , 1979 .

[7]  E. T. Jaynes,et al.  Papers on probability, statistics and statistical physics , 1983 .

[8]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  N. J. Cohen,et al.  Higher-Order Boltzmann Machines , 1986 .

[10]  R. Nigel Horspool,et al.  Data Compression Using Dynamic Markov Modelling , 1987, Comput. J..

[11]  Terrence J. Sejnowski,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cognitive Sciences.

[12]  L. Younes Estimation and annealing for Gibbsian fields , 1988 .

[13]  Stuart German,et al.  Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images , 1988 .

[14]  I. Csiszár A geometric interpretation of Darroch and Ratcliff's generalized iterative scaling , 1989 .

[15]  Bernard Chalmond,et al.  An iterative Gibbsian technique for reconstruction of m-ary images , 1989, Pattern Recognit..

[16]  D. Geman Random fields and inverse problems in imaging , 1990 .

[17]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[18]  Robert L. Mercer,et al.  A Statistical Approach to Sense Disambiguation in Machine Translation , 1991, HLT.

[19]  William J. Byrne,et al.  Alternating minimization and Boltzmann machine learning , 1992, IEEE Trans. Neural Networks.

[20]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[21]  C. Geyer,et al.  Constrained Monte Carlo Maximum Likelihood for Dependent Data , 1992 .

[22]  C. Hwang,et al.  Optimal Spectral Structure of Reversible Stochastic Matrices, Monte Carlo Methods and the Simulation of Markov Random Fields , 1992 .

[23]  Dana Ron,et al.  The Power of Amnesia , 1993, NIPS.

[24]  N. Balram,et al.  Noncausal Gauss Markov random fields: Parameter structure and estimation , 1993, IEEE Trans. Inf. Theory.

[25]  B. Gidas,et al.  A Variational Method for Estimating the Parameters of MRF from Complete or Incomplete Data , 1993 .

[26]  A. Frigessi,et al.  Convergence of Some Partially Parallel Gibbs Samplers with Annealing , 1993 .

[27]  Gerasimos Potamianos,et al.  Partition function estimation of Gibbs random field images using Monte Carlo simulations , 1993, IEEE Trans. Inf. Theory.

[28]  Andreas Stolcke,et al.  Best-first Model Merging for Hidden Markov Model Induction , 1994, ArXiv.

[29]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.