论文信息 - Extracting informative images from web news pages via imbalanced classification

Extracting informative images from web news pages via imbalanced classification

In this paper we propose an imbalanced classification algorithm to extract informative images from web news pages. Our algorithm resolve the difficult problem based on two approaches. First, we limit our dataset to a specific application area so that the patterns of the informative images can be captured by existing classification algorithms. Second, we propose an automatic negative samples filtering algorithm to eliminate most negative samples, so that the classification training data is rebalanced. Because most classification algorithms have reduced performance on imbalanced training data, our algorithm improves the overall performance significantly. In addition, our approach is inherently robust to new web sites and style/layout change of web sites.

Jianping Fan | Hangzai Luo | Wei Gong

[1] Ester Bernadó-Mansilla,et al. The class imbalance problem in learning classifier systems: a preliminary study , 2005, GECCO '05.

[2] Berthier A. Ribeiro-Neto,et al. A brief survey of web data extraction tools , 2002, SGMD.

[3] Bing Liu,et al. Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.