Extracting informative images from web news pages via imbalanced classification
暂无分享,去创建一个
In this paper we propose an imbalanced classification algorithm to extract informative images from web news pages. Our algorithm resolve the difficult problem based on two approaches. First, we limit our dataset to a specific application area so that the patterns of the informative images can be captured by existing classification algorithms. Second, we propose an automatic negative samples filtering algorithm to eliminate most negative samples, so that the classification training data is rebalanced. Because most classification algorithms have reduced performance on imbalanced training data, our algorithm improves the overall performance significantly. In addition, our approach is inherently robust to new web sites and style/layout change of web sites.
[1] Ester Bernadó-Mansilla,et al. The class imbalance problem in learning classifier systems: a preliminary study , 2005, GECCO '05.
[2] Berthier A. Ribeiro-Neto,et al. A brief survey of web data extraction tools , 2002, SGMD.
[3] Bing Liu,et al. Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.