Extracting informative images from web news pages via imbalanced classification

In this paper we propose an imbalanced classification algorithm to extract informative images from web news pages. Our algorithm resolve the difficult problem based on two approaches. First, we limit our dataset to a specific application area so that the patterns of the informative images can be captured by existing classification algorithms. Second, we propose an automatic negative samples filtering algorithm to eliminate most negative samples, so that the classification training data is rebalanced. Because most classification algorithms have reduced performance on imbalanced training data, our algorithm improves the overall performance significantly. In addition, our approach is inherently robust to new web sites and style/layout change of web sites.