An Introduction to Conditional Random Fields

Many tasks involve predicting a large number of variables that depend on each other as well as on other observed variables. Structured prediction methods are essentially a combination of classification and graphical modeling. They combine the ability of graphical models to compactly model multivariate data with the ability of classification methods to perform prediction using large sets of input features. This survey describes conditional random fields, a popular probabilistic method for structured prediction. CRFs have seen wide application in many areas, including natural language processing, computer vision, and bioinformatics. We describe methods for inference and parameter estimation for CRFs, including practical issues for implementing large-scale CRFs. We do not assume previous knowledge of graphical modeling, so this survey is intended to be useful to practitioners in a wide variety of fields.

[1]  H. Robbins A Stochastic Approximation Method , 1951 .

[2]  Andrew McCallum,et al.  Piecewise Training for Undirected Models , 2005, UAI.

[3]  Harry Joe,et al.  Composite Likelihood Methods , 2012 .

[4]  Ian McGraw,et al.  Residual Belief Propagation: Informed Scheduling for Asynchronous Message Passing , 2006, UAI.

[5]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[6]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[7]  Claire Cardie,et al.  Identifying Sources of Opinions with Conditional Random Fields and Extraction Patterns , 2005, HLT.

[8]  Ben Taskar,et al.  Posterior Regularization for Structured Latent Variable Models , 2010, J. Mach. Learn. Res..

[9]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[10]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[11]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[12]  Thomas Deselaers,et al.  Localizing Objects While Learning Their Appearance , 2010, ECCV.

[13]  Miguel Á. Carreira-Perpiñán,et al.  Multiscale conditional random fields for image labeling , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[14]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[15]  J. Andrew Bagnell,et al.  Maximum margin planning , 2006, ICML.

[16]  Gökhan BakIr,et al.  Predicting Structured Data , 2008 .

[17]  Trevor Darrell,et al.  Hidden Conditional Random Fields , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Christian P. Robert,et al.  Monte Carlo Statistical Methods , 2005, Springer Texts in Statistics.

[19]  I JordanMichael,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008 .

[20]  T. Minka,et al.  Local Training and Belief Propagation , 2006 .

[21]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[22]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[23]  Sebastian Nowozin,et al.  Structured Prediction and Learning in Computer Vision , 2011 .

[24]  S. Lelean Learning about research. , 1977, Nursing times.

[25]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[26]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[27]  Matthew Richardson,et al.  Markov logic networks , 2006, Machine Learning.

[28]  Ben Taskar,et al.  Max-Margin Parsing , 2004, EMNLP.

[29]  Yuji Matsumoto,et al.  Applying Conditional Random Fields to Japanese Morphological Analysis , 2004, EMNLP.

[30]  Chih-Jen Lin,et al.  Trust region Newton methods for large-scale logistic regression , 2007, ICML '07.

[31]  Thorsten Joachims,et al.  Learning structural SVMs with latent variables , 2009, ICML '09.

[32]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[33]  François Yvon,et al.  Practical Very Large Scale CRFs , 2010, ACL.

[34]  Hanna M. Wallach,et al.  Efficient Training of Conditional Random Fields , 2002 .

[35]  Andrew McCallum,et al.  Conditional Models of Identity Uncertainty with Application to Noun Coreference , 2004, NIPS.

[36]  Andrew McCallum,et al.  Chinese Segmentation and New Word Detection using Conditional Random Fields , 2004, COLING.

[37]  Christopher Joseph Pal,et al.  Sparse Forward-Backward Using Minimum Divergence Beams for Fast Training Of Conditional Random Fields , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[38]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[39]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[40]  Pushmeet Kohli,et al.  Markov Random Fields for Vision and Image Processing , 2011 .

[41]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[42]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[43]  Thomas P. Minka,et al.  Divergence measures and message passing , 2005 .

[44]  Andrew McCallum,et al.  Piecewise training for structured prediction , 2009, Machine Learning.

[45]  Frank K. Soong,et al.  A Tree.Trellis Based Fast Search for Finding the N Best Sentence Hypotheses in Continuous Speech Recognition , 1990, HLT.

[46]  Yasubumi Sakakibara,et al.  RNA secondary structural alignment with conditional random fields , 2005, ECCB/JBI.

[47]  Burr Settles ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[48]  Phil Blunsom,et al.  Discriminative Word Alignment with Conditional Random Fields , 2006, ACL.

[49]  Andrew McCallum,et al.  Confidence Estimation for Information Extraction , 2004, NAACL.

[50]  Martin J. Wainwright,et al.  Tree-based reparameterization framework for analysis of sum-product and related algorithms , 2003, IEEE Trans. Inf. Theory.

[51]  J. Kiefer,et al.  Stochastic Estimation of the Maximum of a Regression Function , 1952 .

[52]  S. V. N. Vishwanathan,et al.  A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning , 2008, J. Mach. Learn. Res..

[53]  John Langford,et al.  Slow Learners are Fast , 2009, NIPS.

[54]  Ben Taskar,et al.  Posterior vs Parameter Sparsity in Latent Variable Models , 2009, NIPS.

[55]  Joseph K. Bradley,et al.  Learning Tree Conditional Random Fields , 2010, ICML.

[56]  Daniel Marcu,et al.  Learning as search optimization: approximate large margin methods for structured prediction , 2005, ICML.

[57]  T. Minka Discriminative models, not discriminative training , 2005 .

[58]  Pushmeet Kohli,et al.  Robust Higher Order Potentials for Enforcing Label Consistency , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[61]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[62]  John Langford,et al.  Search-based structured prediction , 2009, Machine Learning.

[63]  Andrew McCallum,et al.  FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs , 2009, NIPS.

[64]  Vladimir Kolmogorov,et al.  "GrabCut": interactive foreground extraction using iterated graph cuts , 2004, ACM Trans. Graph..

[65]  Antonio Criminisi,et al.  TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-class Object Recognition and Segmentation , 2006, ECCV.

[66]  Andrew McCallum,et al.  Collective multi-label classification , 2005, CIKM '05.

[67]  Michael I. Jordan,et al.  An asymptotic analysis of generative, discriminative, and pseudolikelihood estimators , 2008, ICML '08.

[68]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[69]  Zoubin Ghahramani,et al.  MCMC for Doubly-intractable Distributions , 2006, UAI.

[70]  Thomas Hofmann,et al.  Hidden Markov Support Vector Machines , 2003, ICML.

[71]  J. Besag Spatial Interaction and the Statistical Analysis of Lattice Systems , 1974 .

[72]  Guy Lebanon,et al.  Stochastic Composite Likelihood , 2010, J. Mach. Learn. Res..

[73]  Andrew Thomas,et al.  WinBUGS - A Bayesian modelling framework: Concepts, structure, and extensibility , 2000, Stat. Comput..

[74]  Gideon S. Mann,et al.  Generalized Expectation Criteria for Semi-Supervised Learning of Conditional Random Fields , 2008, ACL.

[75]  Andrew McCallum,et al.  Efficiently Inducing Features of Conditional Random Fields , 2002, UAI.

[76]  James R. Curran,et al.  Parsing the WSJ Using CCG and Log-Linear Models , 2004, ACL.

[77]  Andrew McCallum,et al.  Collective Segmentation and Labeling of Distant Entities in Information Extraction , 2004 .

[78]  Martin J. Wainwright,et al.  Estimating the wrong Markov random field: Benefits in the computation-limited setting , 2005, NIPS.

[79]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[80]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .

[81]  William T. Freeman,et al.  Constructing free-energy approximations and generalized belief propagation algorithms , 2005, IEEE Transactions on Information Theory.

[82]  Yoshua Bengio,et al.  Semi-supervised Learning by Entropy Minimization , 2004, CAP.

[83]  Andrew McCallum,et al.  A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance , 2005, UAI.

[84]  Martin Szummer,et al.  A Graphical Model for Simultaneous Partitioning and Labeling , 2005, AISTATS.

[85]  Yunsong Guo,et al.  Comparisons of sequence labeling algorithms and extensions , 2007, ICML '07.

[86]  Y. Singer,et al.  Ultraconservative online algorithms for multiclass problems , 2003 .

[87]  Robert J. McEliece,et al.  The generalized distributive law , 2000, IEEE Trans. Inf. Theory.

[88]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[89]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[90]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[91]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[92]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[93]  Frank K. Soong,et al.  A Tree.Trellis Based Fast Search for Finding the N Best Sentence Hypotheses in Continuous Speech Recognition , 1990, HLT.

[94]  Andrew McCallum,et al.  Reducing Weight Undertraining in Structured Discriminative Learning , 2006, NAACL.

[95]  Ben Taskar,et al.  Discriminative Probabilistic Models for Relational Data , 2002, UAI.

[96]  Iain Murray Advances in Markov chain Monte Carlo methods , 2007 .

[97]  Mark Johnson,et al.  Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques , 2002, ACL.

[98]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[99]  Dan Klein,et al.  Learning from measurements in exponential families , 2009, ICML '09.

[100]  Hwee Tou Ng,et al.  Named Entity Recognition with a Maximum Entropy Approach , 2003, CoNLL.

[101]  Pedro M. Domingos,et al.  Discriminative Training of Markov Logic Networks , 2005, AAAI.

[102]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[103]  Adrian F. M. Smith,et al.  Sampling-Based Approaches to Calculating Marginal Densities , 1990 .

[104]  Yuan Qi,et al.  Bayesian Conditional Random Fields , 2005, AISTATS.

[105]  Razvan C. Bunescu,et al.  Collective Information Extraction with Relational Markov Networks , 2004, ACL.

[106]  Dan Roth,et al.  Integer linear programming inference for conditional random fields , 2005, ICML.

[107]  Ben Taskar,et al.  A Discriminative Matching Approach to Word Alignment , 2005, HLT.

[108]  Thore Graepel,et al.  Modelling Uncertainty in the Game of Go , 2004, NIPS.

[109]  Ben Taskar,et al.  Mixture-of-Parents Maximum Entropy Markov Models , 2007, UAI.

[110]  Donald Geman,et al.  Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images , 1984 .

[111]  Yasemin Altun,et al.  Using Conditional Random Fields to Predict Pitch Accents in Conversational Speech , 2004, ACL.

[112]  Koby Crammer,et al.  Global Discriminative Learning for Higher-Accuracy Computational Gene Prediction , 2007, PLoS Comput. Biol..

[113]  Burr Settles,et al.  ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[114]  Fernando Pereira,et al.  Structured Learning with Approximate Inference , 2007, NIPS.

[115]  Guy Lebanon,et al.  Statistical and Computational Tradeoffs in Stochastic Composite Likelihood , 2009, AISTATS.

[116]  Dan Klein,et al.  Structure compilation: trading structure for features , 2008, ICML '08.

[117]  Christian P. Robert,et al.  Monte Carlo Statistical Methods (Springer Texts in Statistics) , 2005 .

[118]  Stan Z. Li,et al.  Markov Random Field Modeling in Image Analysis , 2001, Computer Science Workbench.

[119]  Christopher D. Manning,et al.  Efficient, Feature-based, Conditional Random Field Parsing , 2008, ACL.

[120]  Trevor Darrell,et al.  Conditional Random Fields for Object Recognition , 2004, NIPS.

[121]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[122]  Mark W. Schmidt,et al.  Accelerated training of conditional random fields with stochastic gradient methods , 2006, ICML.

[123]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[124]  A. McCallum,et al.  A Note on Semi-Supervised Learning using Markov Random Fields , 2004 .

[125]  Scott Miller,et al.  Name Tagging with Word Clusters and Discriminative Training , 2004, NAACL.

[126]  Jung-Fu Cheng,et al.  Turbo Decoding as an Instance of Pearl's "Belief Propagation" Algorithm , 1998, IEEE J. Sel. Areas Commun..

[127]  Andrew McCallum,et al.  Improved Dynamic Schedules for Belief Propagation , 2007, UAI.

[128]  Andrew McCallum,et al.  Extracting social networks and contact information from email and the Web , 2004, CEAS.

[129]  Michael L. Wick,et al.  SampleRank : Learning Preferences from Atomic Gradients , 2009 .

[130]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[131]  Trevor Cohn Efficient Inference in Large Conditional Random Fields , 2006, ECML.

[132]  Marie-Pierre Jolly,et al.  Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[133]  Rich Caruana,et al.  An Empirical Comparison of Supervised Learning Algorithms Using Different Performance Metrics , 2005 .

[134]  Lawrence K. Saul,et al.  Advances in Neural Information Processing Systems 17: Proceedings of the 2004 Conference (Bradford Books) , 2005 .

[135]  Trevor Darrell,et al.  Hidden-state Conditional Random Fields , 2006 .

[136]  J. M. Hammersley,et al.  Markov fields on finite graphs and lattices , 1971 .

[137]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[138]  Martial Hebert,et al.  Discriminative Fields for Modeling Spatial Dependencies in Natural Images , 2003, NIPS.

[139]  Ilya Sutskever,et al.  On the Convergence Properties of Contrastive Divergence , 2010, AISTATS.

[140]  R. Zemel,et al.  Multiscale conditional random fields for image labeling , 2004, CVPR 2004.

[141]  Dan Klein,et al.  Conditional Structure versus Conditional Estimation in NLP Models , 2002, EMNLP.

[142]  Jianfeng Gao,et al.  Scalable training of L1-regularized log-linear models , 2007, ICML '07.

[143]  Andrew McCallum,et al.  Efficient training methods for conditional random fields , 2008 .

[144]  Max Welling,et al.  Bayesian Random Fields: The Bethe-Laplace Approximation , 2006, UAI.

[145]  J. Besag Statistical Analysis of Non-Lattice Data , 1975 .

[146]  Joshua Goodman,et al.  Exponential Priors for Maximum Entropy Models , 2004, NAACL.

[147]  Alfonso Valencia,et al.  Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.

[148]  T. Minka A comparison of numerical optimizers for logistic regression , 2004 .

[149]  Sebastian Nowozin,et al.  Structured Learning and Prediction in Computer Vision , 2011, Found. Trends Comput. Graph. Vis..

[150]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[151]  W. Bruce Croft,et al.  Table extraction using conditional random fields , 2003, DG.O.

[152]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[153]  Thomas Hofmann,et al.  Predicting Structured Data (Neural Information Processing) , 2007 .

[154]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[155]  Dale Schuurmans,et al.  Semi-Supervised Conditional Random Fields for Improved Sequence Segmentation and Labeling , 2006, ACL.

[156]  Xavier Carreras,et al.  Exponentiated gradient algorithms for log-linear structured prediction , 2007, ICML '07.

[157]  Paul A. Viola,et al.  Learning to extract information from semi-structured text using a discriminative context free grammar , 2005, SIGIR '05.

[158]  John DeNero,et al.  Painless Unsupervised Learning with Features , 2010, NAACL.

[159]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[160]  Alex Acero,et al.  Hidden conditional random fields for phone classification , 2005, INTERSPEECH.

[161]  Martial Hebert,et al.  Discriminative Random Fields , 2006, International Journal of Computer Vision.

[162]  Jorge Nocedal,et al.  Representations of quasi-Newton matrices and their use in limited memory methods , 1994, Math. Program..

[163]  William W. Cohen,et al.  Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[164]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[165]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[166]  Thomas P. Minka,et al.  The EP energy function and minimization schemes , 2001 .

[167]  Andrew McCallum,et al.  Accurate Information Extraction from Research Papers using Conditional Random Fields , 2004, NAACL.

[168]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[169]  Jaime G. Carbonell,et al.  Protein Fold Recognition Using Segmentation Conditional Random Fields (SCRFs) , 2006, J. Comput. Biol..

[170]  Yuan Qi,et al.  Diagram structure recognition by Bayesian conditional random fields , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).