Exemplar: A Source Code Search Engine for Finding Highly Relevant Applications

A fundamental problem of finding software applications that are highly relevant to development tasks is the mismatch between the high-level intent reflected in the descriptions of these tasks and low-level implementation details of applications. To reduce this mismatch we created an approach called EXEcutable exaMPLes ARchive (Exemplar) for finding highly relevant software projects from large archives of applications. After a programmer enters a natural-language query that contains high-level concepts (e.g., MIME, datasets), Exemplar retrieves applications that implement these concepts. Exemplar ranks applications in three ways. First, we consider the descriptions of applications. Second, we examine the Application Programming Interface (API) calls used by applications. Third, we analyze the dataflow among those API calls. We performed two case studies (with professional and student developers) to evaluate how these three rankings contribute to the quality of the search results from Exemplar. The results of our studies show that the combined ranking of application descriptions and API documents yields the most-relevant search results. We released Exemplar and our case study data to the public.

[1]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[2]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[3]  Charles W. Krueger,et al.  Software reuse , 1992, CSUR.

[4]  Ted J. Biggerstaff,et al.  Program understanding and the concept assignment problem , 1994, CACM.

[5]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[6]  David Notkin,et al.  Software reflexion models: bridging the gap between source and high-level models , 1995, SIGSOFT FSE.

[7]  Scott Henninger,et al.  Supporting the construction and evolution of component repositories , 1996, Proceedings of IEEE 18th International Conference on Software Engineering.

[8]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[9]  Nicolas Anquetil,et al.  Assessing the relevance of identifier names in a legacy software system , 1998, CASCON.

[10]  David Notkin,et al.  Software Reflexion Models: Bridging the Gap between Design and Implementation , 2001, IEEE Trans. Software Eng..

[11]  Gerhard Fischer,et al.  Supporting reuse by delivering task-relevant and personalized information , 2002, ICSE '02.

[12]  Shinji Kusumoto,et al.  Component rank: relative significance rank for software component search , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[13]  William R. Hersh,et al.  Managing Gigabytes—Compressing and Indexing Documents and Images (Second Edition) , 2001, Information Retrieval.

[14]  Rosco Hill,et al.  Automatic method completion , 2004, Proceedings. 19th International Conference on Automated Software Engineering, 2004..

[15]  Kevin Crowston,et al.  The Perils and Pitfalls of Mining SourceForge , 2004, MSR.

[16]  Thorsten Joachims,et al.  Eye-tracking analysis of user behavior in WWW search , 2004, SIGIR '04.

[17]  Janice Singer,et al.  Hipikat: a project memory for software development , 2005, IEEE Transactions on Software Engineering.

[18]  Rastislav Bodík,et al.  Jungloid mining: helping to navigate the API jungle , 2005, PLDI '05.

[19]  Kajal T. Claypool,et al.  Finding a needle in the haystack: a technique for ranking matches between components , 2005, CBSE'05.

[20]  Shinji Kusumoto,et al.  Ranking significance of software components based on use relations , 2003, IEEE Transactions on Software Engineering.

[21]  Gail C. Murphy,et al.  Using structural context to recommend source code examples , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[22]  Robert J. Walker,et al.  Strathcona example recommendation tool , 2005, ESEC/FSE-13.

[23]  Martin P. Robillard,et al.  Automatic generation of suggestions for program investigation , 2005, ESEC/FSE-13.

[24]  Gerhard Fischer,et al.  Reuse-Conducive Development Environments , 2005, Automated Software Engineering.

[25]  Robert J. Walker,et al.  Approximate Structural Context Matching: An Approach to Recommend Relevant Examples , 2006, IEEE Transactions on Software Engineering.

[26]  Brad A. Myers,et al.  Mica: A Web-Search Tool for Finding API Components and Examples , 2006, Visual Languages and Human-Centric Computing (VL/HCC'06).

[27]  Kajal T. Claypool,et al.  XSnippet: mining For sample code , 2006, OOPSLA '06.

[28]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[29]  Premkumar T. Devanbu,et al.  Recommending random walks , 2007, ESEC-FSE '07.

[30]  Yann-Gaël Guéhéneuc,et al.  Feature Location Using Probabilistic Ranking of Methods Based on Execution Scenarios and Information Retrieval , 2007, IEEE Transactions on Software Engineering.

[31]  Tao Xie,et al.  Parseweb: a programmer assistant for reusing open source code on the web , 2007, ASE.

[32]  Mark Grechanik,et al.  Finding Relevant Applications for Prototyping , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[33]  Mark Sanderson,et al.  The relationship between IR effectiveness measures and user satisfaction , 2007, SIGIR.

[34]  Martin P. Robillard,et al.  Topology analysis of software dependencies , 2008, TSEM.

[35]  Rob Miller,et al.  Keyword programming in Java , 2008, Automated Software Engineering.

[36]  Tao Xie,et al.  SpotWeb: Detecting Framework Hotspots and Coldspots via Mining Open Source Code on the Web , 2008, 2008 23rd IEEE/ACM International Conference on Automated Software Engineering.

[37]  Sushil Krishna Bajracharya,et al.  Sourcerer: mining and searching internet-scale software repositories , 2008, Data Mining and Knowledge Discovery.

[38]  Sushil Krishna Bajracharya,et al.  Mining search topics from a code search engine usage log , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[39]  Koushik Sen,et al.  SNIFF: A Search Engine for Java Using Free-Form Queries , 2009, FASE.

[40]  Steven P. Reiss,et al.  Semantics-based code search , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[41]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[42]  Denys Poshyvanyk,et al.  Creating and evolving software by searching, selecting and synthesizing relevant source code , 2009, 2009 31st International Conference on Software Engineering - Companion Volume.

[43]  James D. Herbsleb,et al.  Improving API documentation usability with knowledge pushing , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[44]  Sushil Krishna Bajracharya,et al.  Applying test-driven code search to the reuse of auxiliary functionality , 2009, SAC '09.

[45]  Katsuro Inoue,et al.  Assessing the impact of framework changes using component ranking , 2009, 2009 IEEE International Conference on Software Maintenance.

[46]  Collin McMillan,et al.  A search engine for finding highly relevant applications , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[47]  Joel Ossher,et al.  Searching API usage examples in code repositories with sourcerer API search , 2010, SUITE '10.

[48]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[49]  Scott R. Klemmer,et al.  Example-centric programming: integrating web search into the development environment , 2010, CHI.

[50]  Sushil Krishna Bajracharya,et al.  Leveraging usage similarity for effective retrieval of examples in code repositories , 2010, FSE '10.

[51]  Zhendong Su,et al.  A study of the uniqueness of source code , 2010, FSE '10.

[52]  Cristina V. Lopes,et al.  How Well Do Search Engines Support Code Retrieval on the Web? , 2011, TSEM.