Extracting Various Types of Informative Web Content via Fuzzy Sequential Pattern Mining

In this paper, we present a web content extraction method to extract different types of informative web content for news web pages. A fuzzy sequential pattern mining method, namely FSP, is developed to gradually discover fuzzy sequential patterns for various types of informative web content. To avoid the situation that the usage of HTML tags may be changed with the development of web technology, fuzzy sequential patterns are mined using a stable feature, in particular, the number of tokens in each line of source code. We have conducted extensive experiments and good clustering properties for the discovered sequential patterns are observed. Experimental results demonstrate that the FSP method is effective compared with state-of-the-art content extraction methods. Besides main articles of web pages, it can also find other types interesting web content such as article recommendations and article titles effectively.

[1]  Wolfgang Nejdl,et al.  A densitometric approach to web page segmentation , 2008, CIKM '08.

[2]  Yi Liu,et al.  One-against-all multi-class SVM classification using reliability measures , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[3]  Jiawei Han,et al.  CETR: content extraction via tag ratios , 2010, WWW '10.

[4]  Lejian Liao,et al.  DOM based content extraction via text density , 2011, SIGIR.

[5]  Thomas Gottron,et al.  Content Code Blurring: A New Approach to Content Extraction , 2008, 2008 19th International Workshop on Database and Expert Systems Applications.