CMDR: Classifying nodes for mining data records with different HTML structures

This paper addresses the problem of automated structured data records extraction from web pages. In particular, we focus on the extraction of posts from online forum sites. We show that variability in the HTML structure within user generated content in forum posts can negatively affect the extraction accuracy and propose the integration of a deep learning node classifier in the popular Mining Data Regions (MDR) process proposed in prior work. Experiment on a forum web page dataset containing posts with varying HTML structures indicate the merits of the proposed modification for MDR.

[1]  Weiyi Meng,et al.  Vision-based Web Data Records Extraction , 2006, WebDB.

[2]  David R. Karger,et al.  U-REST: an unsupervised record extraction system , 2007, WWW '07.

[3]  Bing Liu,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[4]  David W. Embley,et al.  Record-boundary discovery in Web documents , 1999, SIGMOD '99.

[5]  Wei Liu,et al.  ViDE: A Vision-Based Approach for Deep Web Data Extraction , 2010, IEEE Transactions on Knowledge and Data Engineering.

[6]  Dimitrios Skoutas,et al.  STAVIES: a system for information extraction from unknown Web data sources through automatic Web wrapper generation using clustering techniques , 2005, IEEE Transactions on Knowledge and Data Engineering.

[7]  Joongmin Choi,et al.  Recognising Informative Web Page Blocks Using Visual Segmentation for Efficient Information Extraction , 2008, J. Univers. Comput. Sci..

[8]  Rafael Corchuelo,et al.  A Survey on Region Extractors from Web Documents , 2013, IEEE Transactions on Knowledge and Data Engineering.

[9]  Vijay V. Raghavan,et al.  Fully automatic wrapper generation for search engines , 2005, WWW '05.

[10]  Chengfei Liu,et al.  Multi-feature and DAG-Based Multi-tree Matching Algorithm for Automatic Web Data Mining , 2014, 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT).

[11]  Calton Pu,et al.  A fully automated object extraction system for the World Wide Web , 2001, Proceedings 21st International Conference on Distributed Computing Systems.

[12]  Lidong Bing,et al.  Towards a unified solution: data record region detection and segmentation , 2011, CIKM '11.

[13]  Vrizlynn L. L. Thing,et al.  A Lightweight Algorithm for Automated Forum Information Processing , 2013, 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT).

[14]  Louise E. Moser,et al.  Extracting data records from the web using tag path clustering , 2009, WWW '09.

[15]  Vrizlynn L. L. Thing,et al.  Generalized and lightweight algorithms for automated web forum content extraction , 2013, 2013 IEEE International Conference on Computational Intelligence and Computing Research.

[16]  Vrizlynn L. L. Thing,et al.  A Generalized Links and Text Properties Based Forum Crawler , 2012, 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[17]  Yonghuai Liu,et al.  Visual Segmentation-Based Data Record Extraction from Web Documents , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[18]  Wei-Ying Ma,et al.  Extracting Content Structure for Web Pages Based on Visual Representation , 2003, APWeb.

[19]  Robert L. Grossman,et al.  Mining data records in Web pages , 2003, KDD '03.

[20]  Georg Lausen,et al.  ViPER: augmenting automatic information extraction with visual perceptions , 2005, CIKM '05.

[21]  Yongquan Dong,et al.  Web Data Extraction Based on Visual Information and Partial Tree Alignment , 2014, 2014 11th Web Information System and Application Conference.

[22]  Frederick H. Lochovsky,et al.  Data extraction and label assignment for web databases , 2003, WWW '03.