Language independent web news extraction system based on text detection framework

Web news provides a direct and efficient way to construct large text corpora. The creation of text data requires an understanding of HTML code and the preparation of customized parsing rules to identify text content in a webpage. Typically, parsing rules are written manually and cannot be applied to pages with different layouts. In this study, we present a web news extraction system that is based on a text detection framework. The proposed method scans the input HTML page and creates text statistics as a projection profile. Then, text block identification is applied to determine a set of content candidates. To filter noise, text verification determines whether a given text block can be included with content. We evaluate the proposed approach with the L3S-GN1 corpus and 3506 multilingual news data items randomly sampled from 325 websites (15 geographic regions and 11 distinct languages). We also compare the proposed method to 23 well-known state-of-the-art techniques. The experimental results show that the proposed method outperforms the second best method (NReadability) by 7.30% in the macro F-measure rate and is 16.91 times faster than NReadability. In terms of the perfect rate, the proposed method demonstrates 46.38% accuracy, whereas the Boilerpipe algorithm demonstrates only 21.54% accuracy. The proposed method is very useful for constructing a multilingual corpus because it requires no language-specific processing component.

[1]  Wei Liu,et al.  ViDE: A Vision-Based Approach for Deep Web Data Extraction , 2010, IEEE Transactions on Knowledge and Data Engineering.

[2]  Ben Wellner,et al.  Adaptive web-page content identification , 2007, WIDM '07.

[3]  Thomas Gottron Combining content extraction heuristics: the CombinE system , 2008, iiWAS.

[4]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[5]  Franz Schweiggert,et al.  Extracting the Main Content of Web Documents based on a Naive Smoothing Method , 2011, KDIR.

[6]  Jian Pei,et al.  News article extraction with template-independent wrapper , 2009, WWW '09.

[7]  Alberto H. F. Laender,et al.  Automatic web news extraction using tree edit distance , 2004, WWW '04.

[8]  Barry Smyth,et al.  Fact or Fiction: Content Classification for Digital Libraries , 2001, DELOS.

[9]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[10]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[11]  Craig A. Knoblock,et al.  Hierarchical Wrapper Induction for Semistructured Information Sources , 2004, Autonomous Agents and Multi-Agent Systems.

[12]  Lejian Liao,et al.  DOM based content extraction via text density , 2011, SIGIR.

[13]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[14]  Lishuang Li,et al.  Two-phase biomedical named entity recognition using CRFs , 2009, Comput. Biol. Chem..

[15]  Marcos André Gonçalves,et al.  Using structural information to improve search in Web collections , 2010 .

[16]  A. F. R. Rahman,et al.  Content Extraction from HTML Documents , 2001 .

[17]  Klaus Berberich,et al.  Mind the gap: large-scale frequent sequence mining , 2013, SIGMOD '13.

[18]  Nasrullah Memon,et al.  Hybrid model of content extraction , 2012, J. Comput. Syst. Sci..

[19]  Eduardo Sany Laber,et al.  An efficient language-independent method to extract content from news webpages , 2011, DocEng '11.

[20]  Ji-Rong Wen,et al.  Template-Independent News Extraction Based on Visual Consistency , 2007, AAAI.

[21]  Rainer Lienhart,et al.  Localizing and segmenting text in images and videos , 2002, IEEE Trans. Circuits Syst. Video Technol..

[22]  Wei-Ying Ma,et al.  Extracting Content Structure for Web Pages Based on Visual Representation , 2003, APWeb.

[23]  Franz Schweiggert,et al.  TitleFinder: extracting the headline of news web pages based on cosine similarity and overlap scoring similarity , 2012, WIDM '12.

[24]  Brad Adelberg,et al.  NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured Data from Text Documents , 1998, SIGMOD Conference.

[25]  Franz Schweiggert,et al.  Extracting the Main Content of Web Documents Based on Character Encoding and a Naive Smoothing Method , 2011, ICSOFT.

[26]  Pavel Pecina,et al.  Web Page Cleaning with Conditional Random Fields , 2007 .

[27]  Saleh Alshomrani,et al.  Bi-languages Mining Algorithm for Extraction Useful Web Contents (BiLEx) , 2015 .

[28]  Dan Roth,et al.  Extracting article text from the web with maximum subsequence segmentation , 2009, WWW '09.

[29]  Michael R. Lyu,et al.  A comprehensive method for multilingual video text detection, localization, and extraction , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[30]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[31]  Jiawei Han,et al.  CETR: content extraction via tag ratios , 2010, WWW '10.

[32]  Hayri Volkan Agun,et al.  An effective and efficient Web content extractor for optimizing the crawling process , 2013, Softw. Pract. Exp..

[33]  Eduardo Sany Laber,et al.  A fast and simple method for extracting relevant content from news webpages , 2009, CIKM.

[34]  Enrique Herrera-Viedma,et al.  Sentiment analysis: A review and comparative analysis of web services , 2015, Inf. Sci..

[35]  Thomas Gottron,et al.  Content Code Blurring: A New Approach to Content Extraction , 2008, 2008 19th International Workshop on Database and Expert Systems Applications.

[36]  Xiaowei Wang,et al.  News Information Extraction Based on Adaptive Weighting Using Unsupervised Bayesian Algorithm , 2011, WISM.

[37]  Jiangfeng Chen,et al.  CELB: Content extraction based on line-block , 2011, 2011 6th International Conference on Computer Sciences and Convergence Information Technology (ICCIT).

[38]  Stefan Evert A Lightweight and Efficient Tool for Cleaning Web Pages , 2008, LREC.