CETR: content extraction via tag ratios

We present Content Extraction via Tag Ratios (CETR) - a method to extract content text from diverse webpages by using the HTML document's tag ratios. We describe how to compute tag ratios on a line-by-line basis and then cluster the resulting histogram into content and non-content areas. Initially, we find that the tag ratio histogram is not easily clustered because of its one-dimensionality; therefore we extend the original approach in order to model the data in two dimensions. Next, we present a tailored clustering technique which operates on the two-dimensional model, and then evaluate our approach against a large set of alternative methods using standard accuracy, precision and recall metrics on a large and varied Web corpus. Finally, we show that, in most cases, CETR achieves better content extraction performance than existing methods, especially across varying web domains, languages and styles.

[1]  Brad Adelberg,et al.  NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured Data from Text Documents , 1998, SIGMOD Conference.

[2]  Andreas Paepcke,et al.  Accordion summarization for end-game browsing on PDAs and cellular phones , 2001, CHI.

[3]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.

[4]  19th International Workshop on Database and Expert Systems Applications (DEXA 2008), 1-5 September 2008, Turin, Italy , 2008, DEXA Workshops.

[5]  Pavel Pecina,et al.  Web Page Cleaning with Conditional Random Fields , 2007 .

[6]  Mehmet A. Orgun,et al.  Separating XHTML content from navigation clutter using DOM-structure block analysis , 2005, HYPERTEXT '05.

[7]  T. V. Raman,et al.  Toward 2W, beyond web 2.0 , 2009, CACM.

[8]  Calton Pu,et al.  Wrapping web data into XML , 2001, SGMD.

[9]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[10]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[11]  Gail E. Kaiser,et al.  DOM-based content extraction of HTML documents , 2003, WWW '03.

[12]  Wei-Ying Ma,et al.  Extracting Content Structure for Web Pages Based on Visual Representation , 2003, APWeb.

[13]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[14]  Baoyao Zhou,et al.  Function-based object model towards website adaptation , 2001, WWW '01.

[15]  Salvatore J. Stolfo,et al.  Extracting context to improve accuracy for HTML content extraction , 2005, WWW '05.

[16]  Ming-Syan Chen,et al.  Mining Web informative structures and contents based on entropy analysis , 2004, IEEE Transactions on Knowledge and Data Engineering.

[17]  Thomas Gottron,et al.  Content Code Blurring: A New Approach to Content Extraction , 2008, 2008 19th International Workshop on Database and Expert Systems Applications.

[18]  Thomas Gottron Combining content extraction heuristics: the CombinE system , 2008, iiWAS.

[19]  Wei-Ying Ma,et al.  Block-level link analysis , 2004, SIGIR '04.

[20]  Xiaoli Li,et al.  Eliminating noisy information in Web pages for data mining , 2003, KDD '03.

[21]  A. F. R. Rahman,et al.  Content Extraction from HTML Documents , 2001 .

[22]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[23]  Nazli Goharian,et al.  Misuse detection for information retrieval systems , 2003, CIKM '03.

[24]  Gail E. Kaiser,et al.  Automating Content Extraction of HTML Documents , 2005, World Wide Web.

[25]  Sandip Debnath,et al.  Identifying Content Blocks from Web Documents , 2005, ISMIS.

[26]  Tim Weninger,et al.  Text Extraction from the Web via Text-to-Tag Ratio , 2008, 2008 19th International Workshop on Database and Expert Systems Applications.

[27]  Thomas Gottron EVALUATING CONTENT EXTRACTION ON HTML DOCUMENTS , 2007 .

[28]  Dan Roth,et al.  Extracting article text from the web with maximum subsequence segmentation , 2009, WWW '09.

[29]  Wei Li,et al.  QuASM: a system for question answering using semi-structured data , 2002, JCDL '02.

[30]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[31]  Liang Chen,et al.  Template detection for large scale search engines , 2006, SAC '06.

[32]  Sandip Debnath,et al.  Automatic extraction of informative blocks from webpages , 2005, SAC '05.

[33]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[34]  Barry Smyth,et al.  Fact or Fiction: Content Classification for Digital Libraries , 2001, DELOS.