qRead: A fast and accurate article extraction method from web pages using partition features optimizations

We present a new method called qRead to achieve real-time content extractions from web pages with high accuracy. Early approaches to content extractions include empirical filtering rules, Document Object Model (DOM) trees, and machine learning models. These methods, while having met with certain success, may not meet the requirements of real-time extraction with high accuracy. For example, constructing a DOM-tree on a complex web page is time-consuming, and using machine learning models could make things unnecessarily more complicated. Different from previous approaches, qRead uses segment densities and similarities to identify main contents. In particular, qRead first filters obvious junk contents, eliminates HTML tags, and partitions the remaining text into natural segments. It then uses the highest ratio of words over the number of lines in a segment combined with similarity between the segment and the title to identify main contents. We show that, through extensive experiments, qRead achieves a 96.8% accuracy on Chinese web pages with an average extraction time of 13.20 milliseconds, and a 93.6% accuracy on English web pages with an average extraction time of 11.37 milliseconds, providing substantial improvements on accuracy over previous approaches and meeting the real-time extraction requirement.

[1]  Ben Wellner,et al.  Adaptive web-page content identification , 2007, WIDM '07.

[2]  Rostislav Khlebnikov,et al.  Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) , 2016 .

[3]  Andrew Tomkins,et al.  The volume and evolution of web page templates , 2005, WWW '05.

[4]  Ming-Syan Chen,et al.  Mining Web informative structures and contents based on entropy analysis , 2004, IEEE Transactions on Knowledge and Data Engineering.

[5]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[6]  Yan Guo,et al.  ECON: An Approach to Extract Content from Web News Page , 2010, 2010 12th International Asia-Pacific Web Conference.

[7]  Hayri Volkan Agun,et al.  A hybrid approach for extracting informative content from web pages , 2013, Inf. Process. Manag..

[8]  Ming-Syan Chen,et al.  Entropy-based link analysis for mining web informative structures , 2002, CIKM '02.

[9]  Sam Liu,et al.  Web document text and images extraction using DOM analysis and natural language processing , 2009, DocEng '09.

[10]  Shumeet Baluja,et al.  Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework , 2006, WWW '06.

[11]  Brad Adelberg,et al.  NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured Data from Text Documents , 1998, SIGMOD Conference.

[12]  Deepayan Chakrabarti,et al.  Page-level template detection via isotonic smoothing , 2007, WWW '07.

[13]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[14]  Calton Pu,et al.  XWRAP: an XML-enabled wrapper construction system for Web information sources , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[15]  Tim Weninger,et al.  Text Extraction from the Web via Text-to-Tag Ratio , 2008, 2008 19th International Workshop on Database and Expert Systems Applications.

[16]  Dan Roth,et al.  Extracting article text from the web with maximum subsequence segmentation , 2009, WWW '09.

[17]  Andreas Paepcke,et al.  Coreex: content extraction from online news articles , 2008, CIKM '08.

[18]  Sandip Debnath,et al.  Automatic identification of informative sections of Web pages , 2005, IEEE Transactions on Knowledge and Data Engineering.

[19]  Michal Skubacz,et al.  Content Extraction from News Pages Using Particle Swarm Optimization on Linguistic and Structural Features , 2007 .

[20]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[21]  Xiaoli Li,et al.  Eliminating noisy information in Web pages for data mining , 2003, KDD '03.

[22]  Jie Wang,et al.  Handling Big Data of Online Social Networks on a Small Machine , 2014, COCOON.