论文信息 - Classification of source code archives

Classification of source code archives

The World Wide Web contains a number of source code archives. Programs are usually classified into various categories within the archive by hand. We report on experiments for automatic classification of source code into these categories. We examined a number of factors that affect classification accuracy. Weighting features by expected entropy loss makes a significant improvement in classification accuracy. We show a Support Vector Machine can be trained to classify source code with a high degree of accuracy. We feel these results show promise for software reuse.

C. Lee Giles | Robert Krovetz | Secil Ugurel

[1] Charles W. Krueger,et al. Software reuse , 1992, CSUR.

[2] J G Daugman,et al. Information Theory and Coding , 1998 .

[3] Mary Beth Rosson,et al. The reuse of uses in Smalltalk programming , 1996, TCHI.

[4] D. Merkl,et al. Content-based software classification by self-organization , 1995, Proceedings of ICNN'95 - International Conference on Neural Networks.

[5] Thorsten Joachims,et al. Text categorization with support vector machines , 1999 .

[6] Amir Michail,et al. Code Search based on CVS Comments: A Preliminary Evaluation , 2001 .

[7] C. Lee Giles,et al. What's the code?: automatic classification of source code archives , 2002, KDD.

[8] Neil C. Rowe,et al. Applying information-retrieval methods to software reuse: a case study , 2003, Inf. Process. Manag..

[9] Kristin P. Bennett,et al. Support vector machines: hype or hallelujah? , 2000, SKDD.

[10] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.