Entropy-based automated wrapper generation for weblog data extraction

This paper proposes a fully automated information extraction methodology for weblogs. The methodology integrates a set of relevant approaches based on the use of web feeds and processing of HTML for the extraction of weblog properties. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a collection of weblogs reporting a prediction accuracy of 89 %. The results of this evaluation show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere.

[1]  Shunsuke Ihara,et al.  Information theory - for continuous systems , 1993 .

[2]  Alexandra I. Cristea,et al.  Self-supervised Automated Wrapper Generation for Weblog Data Extraction , 2013, BNCOD.

[3]  Li Yujian,et al.  A Normalized Levenshtein Distance Metric , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  William E. Winkler,et al.  AN APPLICATION OF THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE TO THE 1990 U.S. DECENNIAL CENSUS , 1987 .

[5]  Bing Liu,et al.  Web data extraction based on partial tree alignment , 2005, WWW '05.

[6]  Maureen Pennock,et al.  ArchivePress: A Really Simple Solution to Archiving Blog Content , 2009, iPRES.

[7]  Marilena Oita,et al.  Archiving Data Objects using Web Feeds , 2010 .

[8]  Georg Gottlob,et al.  Web Data Extraction System , 2009, Encyclopedia of Database Systems.

[9]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[10]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[11]  Pierre Senellart,et al.  Intelligent and Adaptive Crawling of Web Applications for Web Archiving , 2013, ICWE.

[12]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[13]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[14]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[15]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[16]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[17]  Kristinn Sigurðsson Incremental Crawling with Heritrix , 2010 .

[18]  Kai-Uwe Kühnberger,et al.  Classification of Documents Based on the Structure of Their DOM Trees , 2007, ICONIP.

[19]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[20]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[21]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[22]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[23]  W. Dutton,et al.  Next Generation Users: The Internet in Britain , 2011 .

[24]  Karl Rihaczek,et al.  1. WHAT IS DATA MINING? , 2019, Data Mining for the Social Sciences.

[25]  Christoph Meinel,et al.  Mapping the Blogosphere--Towards a Universal and Scalable Blog-Crawler , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[26]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[27]  Georg Gottlob,et al.  Visual Web Information Extraction with Lixto , 2001, VLDB.

[28]  Kweku-Muata Bryson,et al.  Comparison of two families of entropy-based classification measures with and without feature selection , 2001, Proceedings of the 34th Annual Hawaii International Conference on System Sciences.