Many people use the web as the main information source in their daily lives. However, most web pages contain non-information components, such as site bars, footers and ads, etc., which make it complicated to extract text from the original HTML documents. Because of the high human intervention and the low results extraction quality, although the web text extraction techniques have been developed, the popularization and efficiency of the usage still need to be solved.. In this paper, we proposed a maximum subsequence segmentation (MSS) algorithm and discussed its application in the domain of news web sites. Differing from the tree structure analysis and VIPS, the algorithm divided the web into text segmentation and label segmentation. Experiment shows that the MSS algorithm achieves 93.73% accuracy over 2000 news pages from 5 different news sites and the efficiency is much faster than DOM-based using same dataset.
[1]
Arnaud Sahuguet,et al.
Building intelligent Web applications using lightweight wrappers
,
2001,
Data Knowl. Eng..
[2]
Wei-Ying Ma,et al.
VIPS: a Vision-based Page Segmentation Algorithm
,
2003
.
[3]
Martin van den Berg,et al.
Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery
,
1999,
Comput. Networks.
[4]
SahuguetArnaud,et al.
Building intelligent web applications using lightweight wrappers
,
2001
.
[5]
Dan Roth,et al.
Extracting article text from the web with maximum subsequence segmentation
,
2009,
WWW '09.
[6]
Arnon Rungsawang,et al.
Learnable topic-specific web crawler
,
2002,
J. Netw. Comput. Appl..
[8]
Gail E. Kaiser,et al.
DOM-based content extraction of HTML documents
,
2003,
WWW '03.