A novel algorithm for extracting the user reviews from web pages

Extracting the user reviews in websites such as forums, blogs, newspapers, commerce, trips, etc. is crucial for text processing applications (e.g. sentiment analysis, trend detection/monitoring and recommendation systems) which are needed to deal with structured data. Traditional algorithms have three processes consisting of Document Object Model (DOM) tree creation, extraction of features obtained from this tree and machine learning. However, these algorithms increase time complexity of extraction process. This study proposes a novel algorithm that involves two complementary stages. The first stage determines which HTML tags correspond to review layout for a web domain by using the DOM tree as well as its features and decision tree learning. The second stage extracts review layout for web pages in a web domain using the found tags obtained from the first stage. This stage is more time-efficient, being approximately 21 times faster compared to the first stage. Moreover, it achieves a relatively high accuracy of 96.67% in our experiments of review block extraction.

[1]  Ben Wellner,et al.  Adaptive web-page content identification , 2007, WIDM '07.

[2]  Chengfei Liu,et al.  AutoRM: An effective approach for automatic Web data record mining , 2015, Knowl. Based Syst..

[3]  Olga Vechtomova,et al.  Discovering aspects of online consumer reviews , 2016, J. Inf. Sci..

[4]  Ziv Bar-Yossef,et al.  Template detection via data mining and its applications , 2002, WWW.

[5]  Wei Liu,et al.  Automatically extracting user reviews from forum sites , 2011, Comput. Math. Appl..

[6]  Wolfgang Nejdl,et al.  A densitometric approach to web page segmentation , 2008, CIKM '08.

[7]  Kyuseok Shim,et al.  TEXT: Automatic Template Extraction from Heterogeneous Web Pages , 2011, IEEE Transactions on Knowledge and Data Engineering.

[8]  Yida Wang,et al.  Exploring traversal strategy for web forum crawling , 2008, SIGIR '08.

[9]  Xiaoli Li,et al.  Eliminating noisy information in Web pages for data mining , 2003, KDD '03.

[10]  Wei Liu,et al.  ViDE: A Vision-Based Approach for Deep Web Data Extraction , 2010, IEEE Transactions on Knowledge and Data Engineering.

[11]  Yan Guo,et al.  Board Forum Crawling: A Web Crawling Method for Web Forum , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[12]  Wei-Ying Ma,et al.  Improving pseudo-relevance feedback in web information retrieval using web page segmentation , 2003, WWW '03.

[13]  Sourav S. Bhowmick,et al.  Research Issues in Web Data Mining , 1999, DaWaK.

[14]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.

[15]  Berthier A. Ribeiro-Neto,et al.  A brief survey of web data extraction tools , 2002, SGMD.

[16]  Hayri Volkan Agun,et al.  An effective and efficient Web content extractor for optimizing the crawling process , 2013, Softw. Pract. Exp..

[17]  Chia-Hui Chang,et al.  FiVaTech: Page-Level Web Data Extraction from Template Pages , 2010, IEEE Trans. Knowl. Data Eng..

[18]  Wei-Ying Ma,et al.  Detecting web page structure for adaptive viewing on small form factor devices , 2003, WWW '03.

[19]  Wei-Ying Ma,et al.  Extracting Content Structure for Web Pages Based on Visual Representation , 2003, APWeb.

[20]  Rafael Corchuelo,et al.  A Survey on Region Extractors from Web Documents , 2013, IEEE Transactions on Knowledge and Data Engineering.

[21]  Deepayan Chakrabarti,et al.  A graph-theoretic approach to webpage segmentation , 2008, WWW.

[22]  Peter Fankhauser,et al.  Boilerplate detection using shallow text features , 2010, WSDM '10.

[23]  Juliana Freire,et al.  A fast and robust method for web page template detection and removal , 2006, CIKM '06.

[24]  Ibrahim Türkoglu,et al.  Creating meaningful data from web logs for improving the impressiveness of a website by using path analysis method , 2009, Expert Syst. Appl..

[25]  Shumeet Baluja,et al.  Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework , 2006, WWW '06.

[26]  Wei-Ying Ma,et al.  Learning block importance models for web pages , 2004, WWW '04.