Semi-supervised multi-task learning of structured prediction models for web information extraction

Extracting information from web pages is an important problem; it has several applications such as providing improved search results and construction of databases to serve user queries. In this paper we propose a novel structured prediction method to address two important aspects of the extraction problem: (1) labeled data is available only for a small number of sites and (2) a machine learned global model does not generalize adequately well across many websites. For this purpose, we propose a weight space based graph regularization method. This method has several advantages. First, it can use unlabeled data to address the limited labeled data problem and falls in the class of graph regularization based semi-supervised learning approaches. Second, to address the generalization inadequacy of a global model, this method builds a local model for each website. Viewing the problem of building a local model for each website as a task, we learn the models for a collection of sites jointly; thus our method can also be seen as a graph regularization based multi-task learning approach. Learning the models jointly with the proposed method is very useful in two ways: (1) learning a local model for a website can be effectively influenced by labeled and unlabeled data from other websites; and (2) even for a website with only unlabeled examples it is possible to learn a decent local model. We demonstrate the efficacy of our method on several real-life data; experimental results show that significant performance improvement can be obtained by combining semi-supervised and multi-task learning in a single framework.

[1]  Jiawei Han,et al.  CETR: content extraction via tag ratios , 2010, WWW '10.

[2]  Jennifer Neville,et al.  Why collective inference improves relational classification , 2004, KDD.

[3]  Wei-Ying Ma,et al.  Simultaneous record detection and attribute labeling in web data extraction , 2006, KDD '06.

[4]  Dale Schuurmans,et al.  Semi-Supervised Conditional Random Fields for Improved Sequence Segmentation and Labeling , 2006, ACL.

[5]  Louise E. Moser,et al.  Extracting data records from the web using tag path clustering , 2009, WWW '09.

[6]  Marcos André Gonçalves,et al.  ONDUX: on-demand unsupervised learning for information extraction , 2010, SIGMOD Conference.

[7]  Khaled Shaalan,et al.  A Survey of Web Information Extraction Systems , 2006, IEEE Transactions on Knowledge and Data Engineering.

[8]  I. V. Ramakrishnan,et al.  Exploiting Structured Reference Data for Unsupervised Text Segmentation with Conditional Random Fields , 2008, SDM.

[9]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[10]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[11]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[12]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[13]  Andrew McCallum,et al.  Accurate Information Extraction from Research Papers using Conditional Random Fields , 2004, NAACL.

[14]  Rahul Gupta,et al.  Answering Table Augmentation Queries from Unstructured Lists on the Web , 2009, Proc. VLDB Endow..

[15]  Fidel Cacheda,et al.  Using Clustering and Edit Distance Techniques for Automatic Web Data Extraction , 2007, WISE.

[16]  Gholamreza Haffari,et al.  A Rate Distortion Approach for Semi-Supervised Conditional Random Fields , 2009, NIPS.

[17]  Massimiliano Pontil,et al.  Regularized multi--task learning , 2004, KDD.

[18]  Ivor W. Tsang,et al.  Domain adaptation from multiple sources via auxiliary classifiers , 2009, ICML '09.

[19]  Pierre Senellart,et al.  Automatic wrapper induction from hidden-web sources with domain knowledge , 2008, WIDM '08.

[20]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[21]  Lawrence Carin,et al.  Semi-Supervised Multitask Learning , 2007, NIPS.

[22]  Lise Getoor,et al.  Link-Based Classification , 2003, Encyclopedia of Machine Learning and Data Mining.

[23]  Xiaojin Zhu,et al.  Kernel conditional random fields: representation and clique selection , 2004, ICML.

[24]  Lorenzo Blanco,et al.  Redundancy-driven web data extraction and integration , 2010, WebDB '10.

[25]  Sunita Sarawagi,et al.  Automatic segmentation of text into structured records , 2001, SIGMOD '01.

[26]  Sunita Sarawagi,et al.  Information Extraction , 2008 .

[27]  Shui-Lung Chuang,et al.  Context-Aware Wrapping: Synchronized Data Extraction , 2007, VLDB.

[28]  Dimitris Samaras,et al.  Multi-Task Learning of Gaussian Graphical Models , 2010, ICML.

[29]  Eugene Agichtein,et al.  Mining reference tables for automatic text segmentation , 2004, KDD.

[30]  Craig A. Knoblock,et al.  Unsupervised information extraction from unstructured, ungrammatical data sources on the World Wide Web , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[31]  Mikhail Belkin,et al.  Maximum Margin Semi-Supervised Learning for Structured Variables , 2005, NIPS 2005.

[32]  Rich Caruana,et al.  Multitask Learning , 1997, Machine Learning.

[33]  Jun Suzuki,et al.  Semi-Supervised Sequential Labeling and Segmentation Using Giga-Word Scale Unlabeled Data , 2008, ACL.

[34]  Lise Getoor,et al.  Collective Classification in Network Data , 2008, AI Mag..

[35]  Sunita Sarawagi,et al.  Integrating Unstructured Data into Relational Databases , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[36]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[37]  Ivor W. Tsang,et al.  Extracting discriminative concepts for domain adaptation in text mining , 2009, KDD.

[38]  Charles A. Micchelli,et al.  Learning Multiple Tasks with Kernel Methods , 2005, J. Mach. Learn. Res..

[39]  Sunita Sarawagi,et al.  Domain Adaptation of Conditional Probability Models Via Feature Subsetting , 2007, PKDD.

[40]  Slav Petrov,et al.  Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models , 2010, EMNLP.

[41]  Pedro M. Domingos,et al.  Joint Inference in Information Extraction , 2007, AAAI.