论文信息 - Entropy-based automated wrapper generation for weblog data extraction

Entropy-based automated wrapper generation for weblog data extraction

This paper proposes a fully automated information extraction methodology for weblogs. The methodology integrates a set of relevant approaches based on the use of web feeds and processing of HTML for the extraction of weblog properties. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a collection of weblogs reporting a prediction accuracy of 89 %. The results of this evaluation show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere.

[1] Shunsuke Ihara,et al. Information theory - for continuous systems , 1993 .

[2] Alexandra I. Cristea,et al. Self-supervised Automated Wrapper Generation for Weblog Data Extraction , 2013, BNCOD.

[3] Li Yujian,et al. A Normalized Levenshtein Distance Metric , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4] William E. Winkler,et al. AN APPLICATION OF THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE TO THE 1990 U.S. DECENNIAL CENSUS , 1987 .

[5] Bing Liu,et al. Web data extraction based on partial tree alignment , 2005, WWW '05.

[6] Maureen Pennock,et al. ArchivePress: A Really Simple Solution to Archiving Blog Content , 2009, iPRES.

[7] Marilena Oita,et al. Archiving Data Objects using Web Feeds , 2010 .

[8] Georg Gottlob,et al. Web Data Extraction System , 2009, Encyclopedia of Database Systems.

[9] Craig A. Knoblock,et al. Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[10] Calton Pu,et al. XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[11] Pierre Senellart,et al. Intelligent and Adaptive Crawling of Web Applications for Web Archiving , 2013, ICWE.

[12] Berthier A. Ribeiro-Neto,et al. A brief survey of web data extraction tools , 2002, SGMD.

[13] Ian H. Witten,et al. Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[14] อนิรุธ สืบสิงห์,et al. Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[15] Brad Adelberg,et al. NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[16] Peter Fankhauser,et al. Boilerplate detection using shallow text features , 2010, WSDM '10.

[17] Kristinn Sigurðsson. Incremental Crawling with Heritrix , 2010 .

[18] Kai-Uwe Kühnberger,et al. Classification of Documents Based on the Structure of Their DOM Trees , 2007, ICONIP.

[19] Ahmed K. Elmagarmid,et al. Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[20] Valter Crescenzi,et al. RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[21] J. Ross Quinlan,et al. Induction of Decision Trees , 1986, Machine Learning.

[22] Nicholas Kushmerick,et al. Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[23] W. Dutton,et al. Next Generation Users: The Internet in Britain , 2011 .

[24] Karl Rihaczek,et al. 1. WHAT IS DATA MINING? , 2019, Data Mining for the Social Sciences.

[25] Christoph Meinel,et al. Mapping the Blogosphere--Towards a Universal and Scalable Blog-Crawler , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[26] Bing Liu,et al. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[27] Georg Gottlob,et al. Visual Web Information Extraction with Lixto , 2001, VLDB.

[28] Kweku-Muata Bryson,et al. Comparison of two families of entropy-based classification measures with and without feature selection , 2001, Proceedings of the 34th Annual Hawaii International Conference on System Sciences.