Block Classification of a Web Page by Using a Combination of Multiple Classifiers

Recently, researchers have been actively studying on Web mining with various data in the World Wide Web. Since Web pages are generally semi-structured, which makes it difficult to identify informative blocks, techniques of content detection by removing unnecessary data (e.g. advertisements) from the Web pages become important. Generally a Web page consists of many blocks containing various data and structural information. In this paper, we propose a method that classifies the blocks of a Web page into an appropriate category by building a Tree Alignment model representing HTML structure and a Vector model representing the features of the blocks. Web sites normally have their own templates and the blocks may be related to different categories even though they are located in the same position in the Web browser or are structurally similar. Hence it is difficult to classify the blocks into accurate categories through building one classifier. To solve the problem, in our approach, multiple classifiers are built, one for each training domain, and the block classification proceeds through combining them.

[1]  Jaeyoung Yang,et al.  Topic-Specific Web Content Adaptation to Mobile Devices , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[2]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[3]  Sandip Debnath,et al.  Automatic identification of informative sections of Web pages , 2005, IEEE Transactions on Knowledge and Data Engineering.

[4]  Joongmin Choi,et al.  Extraction of User-Defined Data Blocks Using the Regularity of Dynamic Web Pages , 2007, ICIC.

[5]  Bing Liu,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[6]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[7]  Chao Wang,et al.  Mining key information of web pages: A method and its application , 2007, Expert Syst. Appl..

[8]  Jan-Ming Ho,et al.  Discovering informative content blocks from Web documents , 2002, KDD.

[10]  I. V. Ramakrishnan,et al.  Csurf: a context-driven non-visual web-browser , 2007, WWW '07.

[11]  Philip Bille,et al.  A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[12]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[13]  Kouichi Ono,et al.  Annotation-based Web content transcoding , 2000, Comput. Networks.

[14]  Peter Blanchfield,et al.  Web-page adaptation framework for PC & mobile device collaboration , 2005, 19th International Conference on Advanced Information Networking and Applications (AINA'05) Volume 1 (AINA papers).