Machine Learning Approach to Identifying the Dataset Threshold for the Performance Estimators in Supervised Learning

Currently for small-scale machine learning projects, there is no limit which has been set by its researchers to categorise datasets for inexperienced users such as students while assessing and comparing performance of machine learning algorithms. Based on the lack of such a threshold, this paper presents a step by step guide for identifying the dataset threshold for the performance estimators in supervised machine learning experiments. The identification of the dataset threshold involves performing experiments using four different datasets having different sample sizes from the University of California Irvine (UCI) machine learning repository. The sample sizes are categorised in relation to the number of attributes and number of instances available in the dataset. The identified dataset threshold will help unfamiliar machine learning experimenters to categorise datasets correctly and hence selecting the appropriate performance estimation method.

[1]  James Joseph Biundo,et al.  Analysis of Contingency Tables , 1969 .

[2]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[3]  Gregory Piatetsky-Shapiro,et al.  The KDD process for extracting useful knowledge from volumes of data , 1996, CACM.

[4]  Mark Craven,et al.  Extracting comprehensible models from trained neural networks , 1996 .

[5]  Walter F. Bischof,et al.  Machine Learning and Image Interpretation , 1997, Advances in Computer Vision and Machine Intelligence.

[6]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[7]  Michael R. Berthold,et al.  Intelligent Data Analysis , 2000, Springer Berlin Heidelberg.

[8]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[9]  Gail D. Baura,et al.  Nonlinear System Identification , 2002 .

[10]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[11]  Yoshua Bengio,et al.  No Unbiased Estimator of the Variance of K-Fold Cross-Validation , 2003, J. Mach. Learn. Res..

[12]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Data Mining Researchers , 2003 .

[13]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[14]  Yanqing Zhang,et al.  Granular support vector machines for medical binary classification problems , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[15]  Lior Rokach,et al.  Top-down induction of decision trees classifiers - a survey , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[16]  S. Kotsiantis Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[17]  Sarah Jane Delany k-Nearest Neighbour Classifiers , 2007 .

[18]  José Luis Rojo-Álvarez,et al.  An Introduction to Kernel Methods , 2009, Encyclopedia of Data Warehousing and Mining.