Heterogeneous Web Data Extraction Algorithm Based On Modified Hidden Conditional Random Fields

As it is of great importance to extract useful information from heterogeneous Web data, in this paper, we propose a novel heterogeneous Web data extraction algorithm using a modified hidden conditional random fields model. Considering the traditional linear chain based conditional random fields can not effectively solve the problem of complex and heterogeneous Web data extraction, we modify the standard hidden conditional random fields in three aspects, which are 1) Using the hidden Markov model to calculate the hidden variables, 2) Modifying the standard hidden conditional random fields through two stages. In the first stage, each training data sequence is learned using hidden Markov model, and then implicit variables can be visible. In the second stage, parameters can be learned for a given sequence. (3) The objective functions of hidden conditional random fields are revised, and the heterogeneous Web data are extracted by maximizing the posterior probability of the modified hidden conditional random fields. Finally, experiments are conducted to make performance evaluation on two standard datasets-"EData dataset and "Research Papers dataset". Compared with the existing Web data extraction methods, it can be seen that the proposed algorithm can extract useful information from heterogeneous Web data effectively and efficiently.

[1]  Zheng De-quan Automatic collocation extraction using web feedback data , 2010 .

[2]  Trevor Darrell,et al.  Hidden Conditional Random Fields , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Jayant Madhavan,et al.  Web-scale extraction of structured data , 2009, SGMD.

[4]  Wei Liu,et al.  ViDE: A Vision-Based Approach for Deep Web Data Extraction , 2010, IEEE Transactions on Knowledge and Data Engineering.

[5]  Wanli Zuo,et al.  Data extraction and annotation based on domain-specific ontology evolution for deep web , 2011, Comput. Sci. Inf. Syst..

[6]  I-Chen Wu,et al.  A Loosely Coupled Interactive Web Data Extraction System , 2010 .

[7]  Tim Furche,et al.  OXPath: A language for scalable data extraction, automation, and crawling on the deep web , 2012, The VLDB Journal.

[8]  Khaled Shaalan,et al.  FiVaTech: Page-Level Web Data Extraction from Template Pages , 2007, IEEE Transactions on Knowledge and Data Engineering.

[9]  Bing Liu,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[10]  Alberto H. F. Laender,et al.  Automatic generation of agents for collecting hidden Web pages for data extraction , 2004, Data Knowl. Eng..

[11]  Piero Fariselli,et al.  The prediction of organelle-targeting peptides in eukaryotic proteins with Grammatical-Restrained Hidden Conditional Random Fields , 2013, Bioinform..

[12]  Maria Soledad Pera,et al.  Web-based closed-domain data extraction on online advertisements , 2013, Inf. Syst..

[13]  Bo Zhang,et al.  Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction , 2008, J. Mach. Learn. Res..

[14]  Jer Lang Hong Data Extraction for Deep Web Using WordNet , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[15]  Yuan Yan Tang,et al.  Hidden conditional random field-based soccer video events detection , 2012 .

[16]  Stefanos Zafeiriou,et al.  Infinite Hidden Conditional Random Fields for Human Behavior Analysis , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[17]  Xiaofeng Wang,et al.  An ICA Mixture Hidden Conditional Random Field Model for Video Event Classification , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[18]  Andrew McCallum,et al.  Information extraction from research papers using conditional random fields , 2006, Inf. Process. Manag..

[19]  Wei-Ying Ma,et al.  2D Conditional Random Fields for Web information extraction , 2005, ICML.

[20]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[21]  Chia-Hui Chang,et al.  FiVaTech: Page-Level Web Data Extraction from Template Pages , 2010, IEEE Trans. Knowl. Data Eng..

[22]  Zhao Li,et al.  Web data extraction based on structural similarity , 2004, Knowledge and Information Systems.

[23]  Tong Guo,et al.  Distributed Denial of Service Attacks Detection Method Based on Conditional Random Fields , 2013, J. Networks.

[24]  Shaogang Gong,et al.  Action categorization with modified hidden conditional random field , 2010, Pattern Recognit..