Mining extremely small data sets with application to software reuse

A serious problem encountered by machine learning and data mining techniques in software engineering is the lack of sufficient data. For example, there are only 24 examples in the current largest data set on software reuse. In this paper, a recently proposed machine learning algorithm is modified for mining extremely small data sets. This algorithm works in a twice-learning style. In detail, a random forest is trained from the original data set at first. Then, virtual examples are generated from the random forest and used to train a single decision tree. In contrast to the numerous discrepancies between the empirical data and expert opinions reported by previous research, our mining practice shows that the empirical data are actually consistent with expert opinions. Copyright © 2008 John Wiley & Sons, Ltd.

[1]  Zhi-Hua Zhou,et al.  Medical diagnosis with C4.5 rule preceded by artificial neural network ensemble , 2003, IEEE Transactions on Information Technology in Biomedicine.

[2]  Maurizio Morisio,et al.  Success and Failure Factors in Software Reuse , 2002, IEEE Trans. Software Eng..

[3]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[4]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[5]  Ian Witten,et al.  Data Mining , 2000 .

[6]  Zhi-Hua Zhou,et al.  Generation of Comprehensible Hypotheses from Gene Expression Data , 2006, BioDM.

[7]  Igor Kononenko,et al.  On Biases in Estimating Multi-Valued Attributes , 1995, IJCAI.

[8]  Zhihao Chen,et al.  Validation methods for calibrating software effort models , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[9]  Nageswara S. V. Rao,et al.  On Fusers that Perform Better than Best Sensor , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Tin Kam Ho,et al.  Complexity Measures of Supervised Classification Problems , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Nikunj C. Oza,et al.  Online Ensemble Learning , 2000, AAAI/IAAI.

[12]  Anil K. Jain,et al.  39 Dimensionality and sample size considerations in pattern recognition practice , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[13]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[14]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[15]  Zhi-Hua Zhou,et al.  Extracting symbolic rules from trained neural network ensembles , 2003, AI Commun..

[16]  Barry Boehm,et al.  Calibrating Software Cost Models Using Bayesian Analysis , 1998 .

[17]  Thomas G. Dietterich,et al.  Applying the Waek Learning Framework to Understand and Improve C4.5 , 1996, ICML.

[18]  Jeffrey S. Poulin,et al.  The Business Case for Software Reuse , 1993, IBM Syst. J..

[19]  D. Wolpert,et al.  No Free Lunch Theorems for Search , 1995 .

[20]  Zhi-Hua Zhou,et al.  NeC4.5: Neural Ensemble Based C4.5 , 2004, IEEE Trans. Knowl. Data Eng..

[21]  Gail C. Murphy,et al.  Who should fix this bug? , 2006, ICSE.

[22]  Maurizio Morisio,et al.  Comments on "More Success and Failure Factors in Software Reuse" , 2003, IEEE Trans. Software Eng..

[23]  Tim Menzies,et al.  More Success and Failure Factors in Software Reuse , 2003, IEEE Trans. Software Eng..

[24]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[25]  E Mjolsness,et al.  Machine learning for science: state of the art and future prospects. , 2001, Science.

[26]  Anil K. Jain,et al.  Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.