The data representativeness criterion: Predicting the performance of supervised classification based on data set similarity

In a broad range of fields it may be desirable to reuse a supervised classification algorithm and apply it to a new data set. However, generalization of such an algorithm and thus achieving a similar classification performance is only possible when the training data used to build the algorithm is similar to new unseen data one wishes to apply it to. It is often unknown in advance how an algorithm will perform on new unseen data, being a crucial reason for not deploying an algorithm at all. Therefore, tools are needed to measure the similarity of data sets. In this paper, we propose the Data Representativeness Criterion (DRC) to determine how representative a training data set is of a new unseen data set. We present a proof of principle, to see whether the DRC can quantify the similarity of data sets and whether the DRC relates to the performance of a supervised classification algorithm. We compared a number of magnetic resonance imaging (MRI) data sets, ranging from subtle to severe difference is acquisition parameters. Results indicate that, based on the similarity of data sets, the DRC is able to give an indication as to when the performance of a supervised classifier decreases. The strictness of the DRC can be set by the user, depending on what one considers to be an acceptable underperformance.

[1]  A. Hofman,et al.  The Rotterdam Scan Study: design update 2016 and main findings , 2015, European Journal of Epidemiology.

[2]  D. Louis Collins,et al.  Twenty New Digital Brain Phantoms for Creation of Validation Image Data Bases , 2006, IEEE Transactions on Medical Imaging.

[3]  D. Louis Collins,et al.  A new improved version of the realistic digital brain phantom , 2006, NeuroImage.

[4]  Koby Crammer,et al.  Analysis of Representations for Domain Adaptation , 2006, NIPS.

[5]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[6]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Tao Yang,et al.  Word Embedding for Understanding Natural Language: A Survey , 2018 .

[8]  D. Louis Collins,et al.  Design and construction of a realistic digital brain phantom , 1998, IEEE Transactions on Medical Imaging.

[9]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[10]  Hamid R. Rabiee,et al.  Active Learning from Positive and Unlabeled Data , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[11]  Wouter M. Kouw,et al.  Learning An Mr Acquisition-Invariant Representation Using Siamese Neural Networks , 2018, 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019).

[12]  Geoffrey E. Hinton,et al.  Binary coding of speech spectrograms using a deep auto-encoder , 2010, INTERSPEECH.

[13]  Wouter M. Kouw,et al.  A Review of Domain Adaptation without Target Labels , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  King-Sun Fu,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Publication Information , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Xavier Golay,et al.  Routine clinical brain MRI sequences for use at 3.0 Tesla , 2005, Journal of magnetic resonance imaging : JMRI.

[16]  Xiao-Li Meng,et al.  The Art of Data Augmentation , 2001 .

[17]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[18]  H Benoit-Cattin,et al.  The SIMRI project: a versatile and interactive MRI simulator. , 2005, Journal of magnetic resonance.

[19]  Marleen de Bruijne,et al.  Transfer Learning Improves Supervised Image Segmentation Across Imaging Protocols , 2015, IEEE Trans. Medical Imaging.

[20]  Amir Alansary,et al.  MRBrainS Challenge: Online Evaluation Framework for Brain Image Segmentation in 3T MRI Scans , 2015, Comput. Intell. Neurosci..

[21]  Subhojit Ghosh,et al.  Classification of Two Class Motor Imagery Tasks Using Hybrid GA-PSO Based K-Means Clustering , 2015, Comput. Intell. Neurosci..

[22]  Nicolas Bousquet,et al.  Diagnostics of prior-data agreement in applied Bayesian analysis , 2008 .

[23]  N. Schalken Exploring the Data Agreement Criterion as a tool for the evaluation and ranking of expert priors , 2018 .

[24]  Duco Veen,et al.  Using the Data Agreement Criterion to Rank Experts’ Beliefs , 2017, Entropy.

[25]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.