Extracting news text from web pages: an application for the visually impaired

Apart from the actual content, web pages contain several other components (referred to as boilerplate text) that describes how, and in what context the content should be displayed. We show how content bearing text can be efficiently separated from boilerplate text using a random forest classifier. We compare the performance with another state-of-the-art method for boilerplate detection that uses a decision tree classifier and shallow features extracted from the text. The result is a general improvement using the random forest classifier for both classifying problems analyzed, significantly so for the more complex problem. We also show that a small increase in feature set range can lead to even further improved accuracy. The conclusion is that random forest classification can achieve significantly higher accuracy rates than at least one of the current state-of-the-art methods for content extraction. The results can improve content extraction methods for a variety of applications, including search engine optimization and making the web more accessible for the blind or visually impaired.

[1]  Ming-Syan Chen,et al.  Mining Web informative structures and contents based on entropy analysis , 2004, IEEE Transactions on Knowledge and Data Engineering.

[2]  A. K. Singh,et al.  An Efficient Method of Eliminating Noisy Information in Web Pages for Data Mining , 2004, CIT.

[3]  Weida Tong,et al.  Decision Forest: Combining the Predictions of Multiple Independent Decision Tree Models , 2003, J. Chem. Inf. Comput. Sci..

[4]  Barry Smyth,et al.  Fact or Fiction: Content Classification for Digital Libraries , 2001, DELOS.

[5]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[6]  Andreas Paepcke,et al.  Accordion summarization for end-game browsing on PDAs and cellular phones , 2001, CHI.

[7]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.

[8]  Sandip Debnath,et al.  Automatic identification of informative sections of Web pages , 2005, IEEE Transactions on Knowledge and Data Engineering.

[9]  Miroslav Spousta,et al.  Victor : the Web-Page Cleaning Tool , 2008 .

[10]  Wei-Ying Ma,et al.  Extracting Content Structure for Web Pages Based on Visual Representation , 2003, APWeb.

[11]  I. Maqsood,et al.  Random Forests and Decision Trees , 2012 .

[12]  Andrew Tomkins,et al.  The volume and evolution of web page templates , 2005, WWW '05.

[13]  Liang Chen,et al.  Template detection for large scale search engines , 2006, SAC '06.

[14]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[15]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[16]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[17]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.