论文信息 - CETR: content extraction via tag ratios

CETR: content extraction via tag ratios

We present Content Extraction via Tag Ratios (CETR) - a method to extract content text from diverse webpages by using the HTML document's tag ratios. We describe how to compute tag ratios on a line-by-line basis and then cluster the resulting histogram into content and non-content areas. Initially, we find that the tag ratio histogram is not easily clustered because of its one-dimensionality; therefore we extend the original approach in order to model the data in two dimensions. Next, we present a tailored clustering technique which operates on the two-dimensional model, and then evaluate our approach against a large set of alternative methods using standard accuracy, precision and recall metrics on a large and varied Web corpus. Finally, we show that, in most cases, CETR achieves better content extraction performance than existing methods, especially across varying web domains, languages and styles.

[1] Brad Adelberg,et al. NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured Data from Text Documents , 1998, SIGMOD Conference.

[2] Andreas Paepcke,et al. Accordion summarization for end-game browsing on PDAs and cellular phones , 2001, CHI.

[3] Jan-Ming Ho,et al. Discovering informative content blocks from Web documents , 2002, KDD.

[4] 19th International Workshop on Database and Expert Systems Applications (DEXA 2008), 1-5 September 2008, Turin, Italy , 2008, DEXA Workshops.

[5] Pavel Pecina,et al. Web Page Cleaning with Conditional Random Fields , 2007 .

[6] Mehmet A. Orgun,et al. Separating XHTML content from navigation clutter using DOM-structure block analysis , 2005, HYPERTEXT '05.

[7] T. V. Raman,et al. Toward 2W, beyond web 2.0 , 2009, CACM.

[8] Calton Pu,et al. Wrapping web data into XML , 2001, SGMD.

[9] Nicholas Kushmerick,et al. Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[10] Alberto H. F. Laender,et al. Automatic web news extraction using tree edit distance , 2004, WWW '04.

[11] Gail E. Kaiser,et al. DOM-based content extraction of HTML documents , 2003, WWW '03.

[12] Wei-Ying Ma,et al. Extracting Content Structure for Web Pages Based on Visual Representation , 2003, APWeb.

[13] Craig A. Knoblock,et al. Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[14] Baoyao Zhou,et al. Function-based object model towards website adaptation , 2001, WWW '01.

[15] Salvatore J. Stolfo,et al. Extracting context to improve accuracy for HTML content extraction , 2005, WWW '05.

[16] Ming-Syan Chen,et al. Mining Web informative structures and contents based on entropy analysis , 2004, IEEE Transactions on Knowledge and Data Engineering.

[17] Thomas Gottron,et al. Content Code Blurring: A New Approach to Content Extraction , 2008, 2008 19th International Workshop on Database and Expert Systems Applications.

[18] Thomas Gottron. Combining content extraction heuristics: the CombinE system , 2008, iiWAS.

[19] Wei-Ying Ma,et al. Block-level link analysis , 2004, SIGIR '04.

[20] Xiaoli Li,et al. Eliminating noisy information in Web pages for data mining , 2003, KDD '03.

[21] A. F. R. Rahman,et al. Content Extraction from HTML Documents , 2001 .

[22] Ziv Bar-Yossef,et al. Template detection via data mining and its applications , 2002, WWW.

[23] Nazli Goharian,et al. Misuse detection for information retrieval systems , 2003, CIKM '03.

[24] Gail E. Kaiser,et al. Automating Content Extraction of HTML Documents , 2005, World Wide Web.

[25] Sandip Debnath,et al. Identifying Content Blocks from Web Documents , 2005, ISMIS.

[26] Tim Weninger,et al. Text Extraction from the Web via Text-to-Tag Ratio , 2008, 2008 19th International Workshop on Database and Expert Systems Applications.

[27] Thomas Gottron. EVALUATING CONTENT EXTRACTION ON HTML DOCUMENTS , 2007 .

[28] Dan Roth,et al. Extracting article text from the web with maximum subsequence segmentation , 2009, WWW '09.

[29] Wei Li,et al. QuASM: a system for question answering using semi-structured data , 2002, JCDL '02.

[30] Brad Adelberg,et al. NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[31] Liang Chen,et al. Template detection for large scale search engines , 2006, SAC '06.

[32] Sandip Debnath,et al. Automatic extraction of informative blocks from webpages , 2005, SAC '05.

[33] J. MacQueen. Some methods for classification and analysis of multivariate observations , 1967 .

[34] Barry Smyth,et al. Fact or Fiction: Content Classification for Digital Libraries , 2001, DELOS.