What's the code?: automatic classification of source code archives

There are various source code archives on the World Wide Web. These archives are usually organized by application categories and programming languages. However, manually organizing source code repositories is not a trivial task since they grow rapidly and are very large (on the order of terabytes). We demonstrate machine learning methods for automatic classification of archived source code into eleven application topics and ten programming languages. For topical classification, we concentrate on C and C++ programs from the Ibiblio and the Sourceforge archives. Support vector machine (SVM) classifiers are trained on examples of a given programming language or programs in a specified category. We show that source code can be accurately and automatically classified into topical categories and can be identified to be in a specific programming language class.

[1]  J G Daugman,et al.  Information Theory and Coding , 2005 .

[2]  Françoise Fogelman-Soulié,et al.  Neurocomputing : algorithms, architectures and applications , 1990 .

[3]  William P. Birmingham,et al.  Improving category specific Web search by learning query modifications , 2001, Proceedings 2001 Symposium on Applications and the Internet.

[4]  Thorsten Joachims,et al.  Text categorization with support vector machines , 1999 .

[5]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[6]  Gérard Dreyfus,et al.  Single-layer learning revisited: a stepwise procedure for building and training a neural network , 1989, NATO Neurocomputing.

[7]  Nello Cristianini,et al.  Large Margin DAGs for Multiclass Classification , 1999, NIPS.

[8]  Mary Beth Rosson,et al.  The reuse of uses in Smalltalk programming , 1996, TCHI.

[9]  D. Merkl,et al.  Content-based software classification by self-organization , 1995, Proceedings of ICNN'95 - International Conference on Neural Networks.

[10]  Charles W. Krueger,et al.  Software reuse , 1992, CSUR.

[11]  Kristin P. Bennett,et al.  Support vector machines: hype or hallelujah? , 2000, SKDD.

[12]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[13]  James T. Kwok,et al.  Automated Text Categorization Using Support Vector Machine , 1998, ICONIP.

[14]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[15]  Scott Henninger,et al.  Information access tools for software reuse , 1995, J. Syst. Softw..

[16]  Letha H. Etzkorn,et al.  Automatically Identifying Reusable OO Legacy Code , 1997, Computer.

[17]  Amir Michail,et al.  Code Search based on CVS Comments: A Preliminary Evaluation , 2001 .

[18]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[19]  Kurt C. Wallnau Software Technology for Adaptable, Reliable Systems (STARS) , 1990 .