Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

Existing template-independent web data extraction approaches adopt highly ineffective decoupled strategies---attempting to do data record detection and attribute labeling in two separate phases. In this paper, we propose an integrated web data extraction paradigm with hierarchical models. The proposed model is called Dynamic Hierarchical Markov Random Fields (DHMRFs). DHMRFs take structural uncertainty into consideration and define a joint distribution of both model structure and class labels. The joint distribution is an exponential family distribution. As a conditional model, DHMRFs relax the independence assumption as made in directed models. Since exact inference is intractable, a variational method is developed to learn the model's parameters and to find the MAP model structure and label assignments. We apply DHMRFs to a real-world web data extraction task. Experimental results show that: (1) integrated web data extraction models can achieve significant improvements on both record detection and attribute labeling compared to decoupled models; (2) in diverse web data extraction DHMRFs can potentially address the blocky artifact issue which is suffered by fixed-structured hierarchical models.

[1]  Josiane Zerubia,et al.  Multiscale Markov random field models for parallel image classification , 1993, 1993 (4th) International Conference on Computer Vision.

[2]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[3]  Bo Zhang,et al.  Webpage understanding: an integrated approach , 2007, KDD '07.

[4]  Andrew McCallum,et al.  Piecewise Training for Undirected Models , 2005, UAI.

[5]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[6]  Christopher K. I. Williams,et al.  Image Modeling with Position-Encoding Dynamic Trees , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Ben Taskar,et al.  Learning Probabilistic Models of Relational Structure , 2001, ICML.

[8]  Trevor Darrell,et al.  Conditional Random Fields for Object Recognition , 2004, NIPS.

[9]  Wei-Ying Ma,et al.  Simultaneous record detection and attribute labeling in web data extraction , 2006, KDD '06.

[10]  Martial Hebert,et al.  A hierarchical field framework for unified context-based classification , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[11]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[12]  Bo Zhang,et al.  Dynamic hierarchical Markov random fields and their application to web data extraction , 2007, ICML '07.

[13]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[14]  Wei-Ying Ma,et al.  2D Conditional Random Fields for Web information extraction , 2005, ICML.

[15]  Max Welling,et al.  Learning in Markov Random Fields with Contrastive Free Energies , 2005, AISTATS.

[16]  Miguel Á. Carreira-Perpiñán,et al.  On Contrastive Divergence Learning , 2005, AISTATS.

[17]  Wolfgang Gatterbauer,et al.  Towards domain-independent information extraction from web tables , 2007, WWW '07.

[18]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[19]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine-mediated learning.

[20]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[21]  Chia-Hui Chang,et al.  IEPAD: information extraction based on pattern discovery , 2001, WWW '01.

[22]  Martin J. Wainwright,et al.  A new class of upper bounds on the log partition function , 2002, IEEE Transactions on Information Theory.

[23]  Paul A. Viola,et al.  Corrective feedback and persistent learning for information extraction , 2006, Artif. Intell..

[24]  Paul W. Fieguth,et al.  An overlapping tree approach to multiscale stochastic modeling and estimation , 1997, IEEE Trans. Image Process..

[25]  A. Willsky Multiresolution Markov models for signal and image processing , 2002, Proc. IEEE.

[26]  Alan L. Yuille,et al.  The Convergence of Contrastive Divergences , 2004, NIPS.

[27]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[28]  Garry Robins,et al.  An introduction to exponential random graph (p*) models for social networks , 2007, Soc. Networks.

[29]  Wei-Ying Ma,et al.  Block-based web search , 2004, SIGIR '04.

[30]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[31]  Dieter Fox,et al.  Location-Based Activity Recognition , 2005, KI.

[32]  Calton Pu,et al.  A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[33]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[34]  Geoffrey E. Hinton,et al.  A New Learning Algorithm for Mean Field Boltzmann Machines , 2002, ICANN.

[35]  Christopher K. I. Williams,et al.  SDTs: sparse dynamic trees , 1999 .

[36]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[37]  Wei-Ying Ma,et al.  Learning block importance models for web pages , 2004, WWW '04.

[38]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[39]  Robert M. Gray,et al.  Multiresolution image classification by hierarchical modeling with two-dimensional hidden Markov models , 2000, IEEE Trans. Inf. Theory.

[40]  Miguel Á. Carreira-Perpiñán,et al.  Multiscale conditional random fields for image labeling , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[41]  Michael C. Nechyba,et al.  Dynamic trees for unsupervised segmentation and matching of image regions , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Christopher K. I. Williams,et al.  DTs: Dynamic Trees , 1998, NIPS.

[43]  Kristina Lerman,et al.  Using the structure of Web sites for automatic segmentation of tables , 2004, SIGMOD '04.

[44]  Michael I. Jordan,et al.  Probabilistic Networks and Expert Systems , 1999 .