Enhancing Concept Detection by Pruning Data with MCA-Based Transaction Weights

With the rapid increase in the amount of multimedia data, the researches on semantic information retrieval are facing a very challenging problem - the number of positive data instances with the target concept/object/event compared with the number of negative data instances without the target concept/object/event is much smaller, which is also called the data imbalance issue. Therefore, one of the popular topics in multimedia information processing and retrieval is data pruning, a technique that can automatically identify and prune the data instances from the training data set so that the pruned data set is able to enhance the performance of model learning, classification, and concept detection. In this paper, a novel data pruning framework which gives each transaction a weight based on multiple correspondence analysis (MCA) is proposed. These transaction weights are used as the measure for pruning the training data set. Meanwhile, the testing data set could be weighted and pruned as well so that the computational cost is reduced not only when building the model but also when applying the classifiers. Experimenting with 18 high-level concepts and the benchmark (both balanced and imbalanced) data sets from TRECVID, our proposed framework achieves promising results to enhance the concept detection performance of three well-known classifiers commonly used for concept detection.

[1]  Shu-Ching Chen,et al.  Correlation-Based Video Semantic Concept Detection Using Multiple Correspondence Analysis , 2008, 2008 Tenth IEEE International Symposium on Multimedia.

[2]  Min Xu,et al.  Efficient sampling of training set in large and noisy multimedia data , 2007, TOMCCAP.

[3]  Hong Heather Yu,et al.  Overview and Future Trends of Multimedia Research for Content Access and Distribution , 2007, Int. J. Semantic Comput..

[4]  Nicu Sebe,et al.  Content-based multimedia information retrieval: State of the art and challenges , 2006, TOMCCAP.

[5]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[6]  Yufei Tao,et al.  Mining distance-based outliers from large databases in any metric space , 2006, KDD '06.

[7]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[8]  JapkowiczNathalie,et al.  The class imbalance problem: A systematic study , 2002 .

[9]  Alexander Vezhnevets,et al.  Avoiding Boosting Overfitting by Removing Confusing Samples , 2007, ECML.

[10]  Shu-Ching Chen,et al.  Video Semantic Concept Discovery using Multimodal-Based Association Classification , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[11]  Neil Salkind Encyclopedia of Measurement and Statistics , 2006 .

[12]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[13]  James Ze Wang,et al.  Image retrieval: Ideas, influences, and trends of the new age , 2008, CSUR.

[14]  Jun Wang,et al.  Using geometric properties of topographic manifold to detect and track eyes for human-computer interaction , 2007, TOMCCAP.

[15]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[16]  Bernard Mérialdo,et al.  Improving collaborative filtering with multimedia indexing techniques to create user-adapting Web sites , 1999, MULTIMEDIA '99.

[17]  Marcel Worring,et al.  Concept-Based Video Retrieval , 2009, Found. Trends Inf. Retr..

[18]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[19]  Jie Xu,et al.  Improving object detection by removing noisy samples from training sets , 2008, MIR '08.

[20]  Pietro Perona,et al.  Pruning training sets for learning of object categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[21]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[22]  Hui Xiong,et al.  Enhancing data analysis with noise removal , 2006, IEEE Transactions on Knowledge and Data Engineering.

[23]  Wen Gao,et al.  Enhancing Human Face Detection by Resampling Examples Through Manifolds , 2007, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[24]  Sunita Sarawagi,et al.  Efficient top-k count queries over imprecise duplicates , 2009, EDBT '09.

[25]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[26]  Shu-Ching Chen,et al.  Video semantic concept detection via associative classification , 2009, 2009 IEEE International Conference on Multimedia and Expo.