News Web Text Extraction Based on the Maximum Subsequence Segmentation

Many people use the web as the main information source in their daily lives. However, most web pages contain non-information components, such as site bars, footers and ads, etc., which make it complicated to extract text from the original HTML documents. Because of the high human intervention and the low results extraction quality, although the web text extraction techniques have been developed, the popularization and efficiency of the usage still need to be solved.. In this paper, we proposed a maximum subsequence segmentation (MSS) algorithm and discussed its application in the domain of news web sites. Differing from the tree structure analysis and VIPS, the algorithm divided the web into text segmentation and label segmentation. Experiment shows that the MSS algorithm achieves 93.73% accuracy over 2000 news pages from 5 different news sites and the efficiency is much faster than DOM-based using same dataset.