Efficient sampling of training set in large and noisy multimedia data

As the amount of multimedia data is increasing day-by-day thanks to less expensive storage devices and increasing numbers of information sources, machine learning algorithms are faced with large-sized and noisy datasets. Fortunately, the use of a good sampling set for training influences the final results significantly. But using a simple random sample (SRS) may not obtain satisfactory results because such a sample may not adequately represent the large and noisy dataset due to its blind approach in selecting samples. The difficulty is particularly apparent for huge datasets where, due to memory constraints, only very small sample sizes are used. This is typically the case for multimedia applications, where data size is usually very large. In this article we propose a new and efficient method to sample of large and noisy multimedia data. The proposed method is based on a simple distance measure that compares the histograms of the sample set and the whole set in order to estimate the representativeness of the sample. The proposed method deals with noise in an elegant manner which SRS and other methods are not able to deal with. We experiment on image and audio datasets. Comparison with SRS and other methods shows that the proposed method is vastly superior in terms of sample representativeness, particularly for small sample sizes although time-wise it is comparable to SRS, the least expensive method in terms of time.

[1]  Bin Chen,et al.  Efficient data reduction with EASE , 2003, KDD '03.

[2]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[3]  Nitesh V. Chawla,et al.  Creating ensembles of classifiers , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[4]  Foster J. Provost,et al.  Active Learning for Class Probability Estimation and Ranking , 2001, IJCAI.

[5]  B. S. Manjunath,et al.  Introduction to mpeg-7 , 2002 .

[6]  Xindong Wu,et al.  Eliminating Class Noise in Large Datasets , 2003, ICML.

[7]  Tong Zhang,et al.  Active learning using adaptive resampling , 2000, KDD '00.

[8]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[9]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[10]  Foster J. Provost,et al.  Active Sampling for Class Probability Estimation and Ranking , 2004, Machine Learning.

[11]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[12]  Rong Yan,et al.  Image Classification Using a Bigram Model , 2003 .

[13]  Mark Plutowski,et al.  Selecting concise training sets from clean data , 1993, IEEE Trans. Neural Networks.

[14]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[15]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[16]  D. Angluin Queries and Concept Learning , 1988 .

[17]  Surya Nepal,et al.  Automatic detection of 'Goal' segments in basketball videos , 2001, MULTIMEDIA '01.

[18]  Min Xu,et al.  EASIER Sampling for Audio Event Identification , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[19]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[20]  Qi Tian,et al.  HMM-Based Audio Keyword Generation , 2004, PCM.

[21]  Bo Thiesson,et al.  The Learning-Curve Sampling Method Applied to Model-Based Clustering , 2002, J. Mach. Learn. Res..

[22]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[23]  Newton Lee,et al.  ACM Transactions on Multimedia Computing, Communications and Applications (ACM TOMCCAP) , 2007, CIE.

[24]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[25]  Daphne Koller,et al.  Support Vector Machine Active Learning with Application sto Text Classification , 2000, ICML.

[26]  Bin Chen,et al.  A new two-phase sampling based algorithm for discovering association rules , 2002, KDD.

[27]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[28]  Frank Olken,et al.  Random Sampling from Databases , 1993 .

[29]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[30]  Dana Angluin,et al.  Queries and concept learning , 1988, Machine Learning.

[31]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[32]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[33]  B. S. Manjunath,et al.  Introduction to MPEG-7: Multimedia Content Description Interface , 2002 .

[34]  Empirical evaluation of MPEG-7 XM color descriptors in content-based retrieval of semantic image categories , 2002, Object recognition supported by user interaction for service robots.

[35]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[36]  David A. Cohn,et al.  Training Connectionist Networks with Queries and Selective Sampling , 1989, NIPS.

[37]  Qi Tian,et al.  A mid-level representation framework for semantic sports video analysis , 2003, ACM Multimedia.

[38]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[39]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD 2000.

[40]  Anoop Gupta,et al.  Automatically extracting highlights for TV Baseball programs , 2000, ACM Multimedia.

[41]  Changsheng Xu,et al.  Audio keyword generation for sports video analysis , 2004, MULTIMEDIA '04.

[42]  Patrick Haffner,et al.  Support vector machines for histogram-based image classification , 1999, IEEE Trans. Neural Networks.

[43]  Stefan Wrobel,et al.  Active Hidden Markov Models for Information Extraction , 2001, IDA.

[44]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.