Web content extraction based on maximum continuous sum of text density

Generally different websites have different web page structures, which would heavily affect the extraction quality when the web content is automatically collected. On the basis of a statistical analysis on content features and structure characteristics of News domain web pages, this paper proposes a maximum continuous sum of text density (MCSTD) method to efficiently and effectively extract web content from different web pages. Firstly, web pages are preprocessed, and then the text density of texts are calculated. Finally, the web content is extracted using the proposed MCSTD method. Experimental results show that the extraction precision is over 95%, and the proposed approach is more efficient and easier to be implemented compared to traditional models. Additionally, our method has also been applied to the scenario of comparable corpora construction using extracted web resource.

[1]  Dan Roth,et al.  Extracting article text from the web with maximum subsequence segmentation , 2009, WWW '09.

[2]  Chia-Hui Chang,et al.  MapMarker: Extraction of Postal Addresses and Associated Information for General Web Pages , 2010, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[3]  Kuan-Yu He,et al.  Improving Identification of Latent User Goals through Search-Result Snippet Classification , 2007 .

[4]  Ben Wellner,et al.  Adaptive web-page content identification , 2007, WIDM '07.

[5]  Salvador Tamarit,et al.  A Benchmark Suite for Template Detection and Content Extraction , 2014, ArXiv.

[6]  Guan Yi,et al.  A Statistical Approach for Content Extraction from Web Page , 2004 .

[7]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[8]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[9]  Tim Weninger,et al.  Text Extraction from the Web via Text-to-Tag Ratio , 2008, 2008 19th International Workshop on Database and Expert Systems Applications.

[10]  Andrew Tomkins,et al.  The volume and evolution of web page templates , 2005, WWW '05.

[11]  Michal Skubacz,et al.  Content Extraction from News Pages Using Particle Swarm Optimization on Linguistic and Structural Features , 2007 .