Shaping datasets: Optimal data selection for specific target distributions across dimensions

This paper presents a method for dataset manipulation based on Mixed Integer Linear Programming (MILP). The proposed optimization can narrow down a dataset to a particular size, while enforcing specific distributions across different dimensions. It essentially leverages the redundancies of an initial dataset in order to generate more compact versions of it, with a specific target distribution across each dimension. If the desired target distribution is uniform, then the effect is balancing: all values across all different dimensions are equally represented. Other types of target distributions can also be specified, depending on the nature of the problem. The proposed approach may be used in machine learning, for shaping training and testing datasets, or in crowdsourcing, for preparing datasets of a manageable size.

[1]  Thomas S. Huang,et al.  Interactive Facial Feature Localization , 2012, ECCV.

[2]  Kristen Grauman,et al.  Relative attributes , 2011, 2011 International Conference on Computer Vision.

[3]  Jianxiong Xiao,et al.  What Makes a Photograph Memorable? , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Stefan Winkler,et al.  How do users make a people-centric slideshow? , 2013, CrowdMM '13.

[5]  Stevan Rudinac,et al.  Learning Crowdsourced User Preferences for Visual Summarization of Image Collections , 2013, IEEE Transactions on Multimedia.

[6]  Lina J. Karam,et al.  A No-Reference Objective Image Sharpness Metric Based on the Notion of Just Noticeable Blur (JNB) , 2009, IEEE Transactions on Image Processing.

[7]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[8]  Tsuhan Chen,et al.  Clothing cosegmentation for recognizing people , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Jens Viggo Clausen Parallel Branch and Bound — Principles and Personal Experiences , 1997 .

[10]  Sabine Süsstrunk,et al.  Measuring colorfulness in natural images , 2003, IS&T/SPIE Electronic Imaging.

[11]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[12]  Fernando De la Torre,et al.  Supervised Descent Method and Its Applications to Face Alignment , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Stefan Winkler,et al.  Modeling Image Appeal Based on Crowd Preferences for Automated Person-Centric Collage Creation , 2014, CrowdMM '14.

[14]  Martin W. P. Savelsbergh,et al.  Integer-Programming Software Systems , 2005, Ann. Oper. Res..