Assessment of Classification Models with Small Amounts of Data

One of the tasks of data mining is classification, which provides a mapping from attributes (observations) to pre-specified classes. Classification models are built by using underlying data. In principle, the models built with more data yield better results. However, the relationship between the available data and the performance is not well understood, except that the accuracy of a classification model has diminishing improvements as a function of data size. In this paper, we present an approach for an early assessment of the extracted knowledge (classification models) in the terms of performance (accuracy), based on the amount of data used. The assessment is based on the observation of the performance on smaller sample sizes. The solution is formally defined and used in an experiment. In experiments we show the correctness and utility of the approach.

[1]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[2]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[3]  Douglas H. Fisher,et al.  Modeling decision tree performance with the power law , 1999, AISTATS.

[4]  Tatjana Welzer,et al.  An Algorithm for Protecting Knowledge Discovery Data , 2003, Informatica.

[5]  Tatjana Welzer,et al.  Predicting Sample Size in Data Mining Tasks: Using Additive Incremental Approach , 2000, EJC.

[6]  John R. Anderson,et al.  Reflections of the Environment in Memory Form of the Memory Functions , 2022 .

[7]  Karim K. Hirji,et al.  Discovering data mining: from concept to implementation , 1999, SKDD.

[8]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[9]  H. Jaakkola,et al.  Convergence detection criteria for classification based on final error rate , 2005, IEEE 3rd International Conference on Computational Cybernetics, 2005. ICCC 2005..

[10]  S.J.J. Smith,et al.  Empirical Methods for Artificial Intelligence , 1995 .

[11]  Tim Oates,et al.  Efficient progressive sampling , 1999, KDD '99.

[12]  Tatjana Welzer,et al.  Convergence detection in classification task of knowledge discovery process , 2001, PICMET '01. Portland International Conference on Management of Engineering and Technology. Proceedings Vol.1: Book of Summaries (IEEE Cat. No.01CH37199).

[13]  Tatjana Welzer,et al.  Protecting Medical Data for Decision-Making Analyses , 2005, Journal of Medical Systems.

[14]  Michael J. A. Berry,et al.  Data mining techniques - for marketing, sales, and customer support , 1997, Wiley computer publishing.

[15]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[16]  Tatjana Welzer,et al.  Early Assessment of Classification Performance , 2004, ACSW.

[17]  Tatjana Welzer,et al.  Data protection for outsourced data mining , 2002, Informatica.

[18]  Claire Cardie,et al.  UMass/Hughes: Description of the CIRCUS System Used for MUC-51 , 1993, MUC.