Effective Feature Space Reduction with Imbalanced Data for Semantic Concept Detection

Semantic understanding of multimedia content has become a very popular research topic in recent years. Semantic concept detection algorithms face many challenges such as the semantic gap and imbalance data, among others. In this paper, we propose a novel algorithm using multiple correspondence analysis (MCA) to discover the correlation between features and classes to reduce the feature space and to bridge the semantic gap. Moreover, the proposed algorithm is able to explore the correlation between items (i.e., feature-value pairs generated for each of the features) and classes which expands its ability to handle imbalance data sets. To evaluate the proposed algorithm, we compare its performance on semantic concept detection with several existing feature selection methods under various well-known classifiers using some of the concepts and benchmark data available from the TRECVID project. The results demonstrate that our proposed algorithm achieves promising performance, and it performs significantly better than those feature selection methods in the comparison for the imbalanced data sets.

[1]  Hong Heather Yu,et al.  Overview and Future Trends of Multimedia Research for Content Access and Distribution , 2007, Int. J. Semantic Comput..

[2]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[3]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[4]  Carla E. Brodley,et al.  Feature Subset Selection and Order Identification for Unsupervised Learning , 2000, ICML.

[5]  Neil Salkind Encyclopedia of Measurement and Statistics , 2006 .

[6]  Filippo Menczer,et al.  Feature selection in unsupervised learning via evolutionary search , 2000, KDD '00.

[7]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[8]  Jian Pei,et al.  Data Mining: Concepts and Techniques, 3rd edition , 2006 .

[9]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[10]  JapkowiczNathalie,et al.  The class imbalance problem: A systematic study , 2002 .

[11]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[12]  Shu-Ching Chen,et al.  Video Semantic Concept Discovery using Multimodal-Based Association Classification , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[13]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[14]  Sanmay Das,et al.  Filters, Wrappers and a Boosting-Based Hybrid for Feature Selection , 2001, ICML.

[15]  Michael I. Jordan,et al.  Feature selection for high-dimensional genomic microarray data , 2001, ICML.

[16]  Shu-Ching Chen,et al.  Modeling Semantic Concepts and User Preferences in Content-Based Video Retrieval , 2007, Int. J. Semantic Comput..

[17]  Foster J. Provost,et al.  Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction , 2003, J. Artif. Intell. Res..

[18]  Min Chen,et al.  Semantic event detection via multimodal data mining , 2006, IEEE Signal Processing Magazine.

[19]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[20]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[21]  John R. Smith,et al.  A multi-modal system for the retrieval of semantic video events , 2004, Comput. Vis. Image Underst..

[22]  Takeo Kanade,et al.  Object Detection Using the Statistics of Parts , 2004, International Journal of Computer Vision.