Toward mining "concept keywords" from identifiers in large software projects

We propose the Concept Keyword Term Frequency/Inverse Document Frequency (ckTF/IDF) method as a novel technique to efficiency mine concept keywords from identifiers in large software projects. ckTF/IDF is suitable for mining concept keywords, since the ckTF/IDF is more lightweight than the TF/IDF method, and the ckTF/IDF's heuristics is tuned for identifiers in programs.We then experimentally apply the ckTF/IDF to our educational operating system udos, consisting of around 5,000 lines in C code, which produced promising results; the udos's source code was processed in 1.4 seconds with an accuracy of around 57%. This preliminary result suggests that our approach is useful for mining concept keywords from identifiers, although we need more research and experience.

[1]  Tomoya Suzuki,et al.  Binary-level lightweight data integration to develop program understanding tools for embedded software in C , 2004, 11th Asia-Pacific Software Engineering Conference.

[2]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[3]  Nicolas Anquetil,et al.  Extracting concepts from file names; a new file clustering criterion , 1998, Proceedings of the 20th International Conference on Software Engineering.

[4]  Nicolas Anquetil,et al.  Characterizing the informal knowledge contained in systems , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[5]  Patrick A. V. Hall,et al.  Overview of reverse engineering and reuse research , 1992, Inf. Softw. Technol..

[6]  Paolo Tonella,et al.  Nomen est omen: analyzing the language of function identifiers , 1999, Sixth Working Conference on Reverse Engineering (Cat. No.PR00303).

[7]  Paolo Tonella,et al.  Restructuring program identifier names , 2000, Proceedings 2000 International Conference on Software Maintenance.

[8]  David Notkin,et al.  An empirical study of static call graph extractors , 1998, TSEM.