A novel approach for content extraction from web pages

The rapid development of the internet and web publishing techniques create numerous information sources published as HTML pages on World Wide Web. However, there is lot of redundant and irrelevant information also on web pages. Navigation panels, Table of content (TOC), advertisements, copyright statements, service catalogs, privacy policies etc. on web pages are considered as relevant and irrelevant content. Such information makes various web mining tasks such as web page crawling, web page classification, link based ranking, topic distillation complex. This paper discusses various approaches for extracting informative content from web pages and a new approach for content extraction from web pages using word to leaf ratio and density of links.

[1]  Jiawei Han,et al.  CETR: content extraction via tag ratios , 2010, WWW '10.

[2]  Ming-Syan Chen,et al.  WISDOM: Web intrapage informative structure mining based on document object model , 2005, IEEE Transactions on Knowledge and Data Engineering.

[3]  Jie Yang,et al.  A Novel Method to Extract Informative Blocks from Web Pages , 2009, 2009 International Joint Conference on Artificial Intelligence.

[4]  Wei Song,et al.  A Generalized Regression Neural Network Based on Fuzzy Means Clustering and Its Application in System Identification , 2007, 2007 International Symposium on Information Technology Convergence (ISITC 2007).

[5]  Hahn-Ming Lee,et al.  Enhancing Entropy-based Informative Block Identification Using Block Preclustering Technology , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[6]  Dai Quoc Nguyen,et al.  A Fast Template-Based Approach to Automatically Identify Primary Text Content of a Web Page , 2009, 2009 International Conference on Knowledge and Systems Engineering.

[7]  Hung-Yu Kao,et al.  The Mining and Extraction of Primary Informative Blocks and Data Objects from Systematic Web Pages , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[8]  Wei-Ying Ma,et al.  Extracting Content Structure for Web Pages Based on Visual Representation , 2003, APWeb.

[9]  ChenMing-Syan,et al.  Mining Web Informative Structures and Contents Based on Entropy Analysis , 2004 .

[10]  Chao Wang,et al.  Mining key information of web pages: A method and its application , 2007, Expert Syst. Appl..

[11]  A. F. R. Rahman,et al.  Content Extraction from HTML Documents , 2001 .

[12]  Sandip Debnath,et al.  Automatic identification of informative sections of Web pages , 2005, IEEE Transactions on Knowledge and Data Engineering.

[13]  Hayri Volkan Agun,et al.  A hybrid approach for extracting informative content from web pages , 2013, Inf. Process. Manag..

[14]  Jihua Song,et al.  Web Content Information Extraction Approach Based on Removing Noise and Content-Features , 2010, 2010 International Conference on Web Information Systems and Mining.

[15]  Ming-Syan Chen,et al.  Mining Web informative structures and contents based on entropy analysis , 2004, IEEE Transactions on Knowledge and Data Engineering.

[16]  Thomas Gottron,et al.  Content Code Blurring: A New Approach to Content Extraction , 2008, 2008 19th International Workshop on Database and Expert Systems Applications.

[17]  Andrew Tomkins,et al.  The volume and evolution of web page templates , 2005, WWW '05.

[18]  Salvador Tamarit,et al.  Using the words/leafs ratio in the DOM tree for content extraction , 2013, J. Log. Algebraic Methods Program..

[19]  Zongwei Luo,et al.  A Semantic DOM Approach for Webpage Information Extraction , 2009, 2009 International Conference on Management and Service Science.

[20]  Joongmin Choi,et al.  Detecting Informative Web Page Blocks for Efficient Information Extraction Using Visual Block Segmentation , 2007, 2007 International Symposium on Information Technology Convergence (ISITC 2007).

[21]  Hui Zhang,et al.  Block-Level Linkes Based Content Extraction , 2011, 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming.

[22]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.