Minersoft: Software retrieval in grid and cloud computing infrastructures

One of the main goals of Cloud and Grid infrastructures is to make their services easily accessible and attractive to end-users. In this article we investigate the problem of supporting keyword-based searching for the discovery of software files that are installed on the nodes of large-scale, federated Grid and Cloud computing infrastructures. We address a number of challenges that arise from the unstructured nature of software and the unavailability of software-related metadata on large-scale networked environments. We present Minersoft, a harvester that visits Grid/Cloud infrastructures, crawls their file systems, identifies and classifies software files, and discovers implicit associations between them. The results of Minersoft harvesting are encoded in a weighted, typed graph, called the Software Graph. A number of information retrieval (IR) algorithms are used to enrich this graph with structural and content associations, to annotate software files with keywords and build inverted indexes to support keyword-based searching for software. Using a real testbed, we present an evaluation study of our approach, using data extracted from production-quality Grid and Cloud computing infrastructures. Experimental results show that Minersoft is a powerful tool for software search and discovery.

[1]  Yi Zhang,et al.  Searching and navigating petabyte-scale file systems based on facets , 2007, PDSW '07.

[2]  Jeannette M. Wing,et al.  Specification matching of software components , 1997 .

[3]  Karl Gyllstrom,et al.  Confluence: enhancing contextual desktop search , 2007, SIGIR.

[4]  M. Dikaiakos Information Services for Large-Scale Grids A Case for a Grid Search Engine , 2005 .

[5]  Udi Manber,et al.  Finding Similar Files in a Large File System , 1994, USENIX Winter.

[6]  Johannes Elmsheuser,et al.  Ganga: A tool for computational-task management and easy access to Grid resources , 2009, Comput. Phys. Commun..

[7]  Pierre Jouvelot,et al.  Semantic file systems , 1991, SOSP '91.

[8]  Ian Witten,et al.  Data Mining , 2000 .

[9]  Frederico Araújo Durão,et al.  A cooperative classification mechanism for search and retrieval software components , 2007, SAC '07.

[10]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[11]  Reidar Conradi,et al.  An empirical investigation of software reuse benefits in a large telecom product , 2008, TSEM.

[12]  Martin P. Robillard,et al.  Topology analysis of software dependencies , 2008, TSEM.

[13]  Genny Tortora,et al.  Recovering traceability links in software artifact management systems using information retrieval methods , 2007, TSEM.

[14]  Carlos Maltzahn,et al.  Richer file system metadata using links and attributes , 2005, 22nd IEEE / 13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST'05).

[15]  Lee Rainie,et al.  The future of cloud computing , 2010 .

[16]  Jeannette M. Wing,et al.  Specification matching of software components , 1995, TSEM.

[17]  Marios D. Dikaiakos,et al.  Harvesting Large-Scale Grids for Software Resources , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[18]  Mohamed Jmaiel,et al.  An Integration Ontology for Components Composition , 2010, Int. J. Web Portals.

[19]  Colin Atkinson,et al.  Extreme Harvesting: test driven discovery and reuse of software components , 2004, Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, 2004. IRI 2004..

[20]  Hai Liu,et al.  Web services provision: solutions, challenges and opportunities (invited paper) , 2009, ICUIMC '09.

[21]  Zhendong Su,et al.  Scalable detection of semantic clones , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[22]  Cristina V. Lopes,et al.  How Well Do Search Engines Support Code Retrieval on the Web? , 2011, TSEM.

[23]  Andrian Marcus,et al.  Recovering documentation-to-source-code traceability links using latent semantic indexing , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[24]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[25]  Craig A. N. Soules,et al.  Connections: using context to enhance file search , 2005, SOSP '05.

[26]  Shinji Kusumoto,et al.  Ranking significance of software components based on use relations , 2003, IEEE Transactions on Software Engineering.

[27]  Sushil Krishna Bajracharya,et al.  Sourcerer: mining and searching internet-scale software repositories , 2008, Data Mining and Knowledge Discovery.

[28]  Zhi-Hua Zhou,et al.  Improving Web search using image snippets , 2008, TOIT.

[29]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[30]  Charles L. A. Clarke,et al.  Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[31]  Collin McMillan,et al.  A search engine for finding highly relevant applications , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[32]  Beng Chin Ooi,et al.  The Claremont report on database research , 2008, SGMD.

[33]  Dov Dori,et al.  A semantic approach to approximate service retrieval , 2007, TOIT.

[34]  Antônio Francisco do Prado,et al.  A survey on software components search and retrieval , 2004, Proceedings. 30th Euromicro Conference, 2004..

[35]  Ian T. Foster,et al.  The Anatomy of the Grid: Enabling Scalable Virtual Organizations , 2001, Int. J. High Perform. Comput. Appl..

[36]  Daniel Lucrédio,et al.  A survey on software components search and retrieval , 2004 .

[37]  Christos Faloutsos,et al.  Graph evolution: Densification and shrinking diameters , 2006, TKDD.

[38]  Bob Jones,et al.  The Organization and Management of Grid Infrastructures , 2009, Computer.

[39]  Giuliano Antoniol,et al.  Recovering Traceability Links between Code and Documentation , 2002, IEEE Trans. Software Eng..

[40]  Antonio Brogi,et al.  Semantics-based composition-oriented discovery of Web services , 2008, TOIT.

[41]  Guangwen Yang,et al.  SmartScan : Efficient Metadata Crawl for Storage Management Metadata Querying in Large File Systems , 2010 .

[42]  Collin McMillan,et al.  Portfolio: finding relevant functions and their usage , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[43]  Peter Willett,et al.  Readings in information retrieval , 1997 .

[44]  Marios D. Dikaiakos,et al.  Cloud Computing: Distributed Internet Computing for IT and Scientific Research , 2009, IEEE Internet Computing.

[45]  Beng Chin Ooi,et al.  EASE: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data , 2008, SIGMOD Conference.

[46]  Mark Sanderson,et al.  The relationship between IR effectiveness measures and user satisfaction , 2007, SIGIR.

[47]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[48]  Gail E. Kaiser,et al.  An Information Retrieval Approach For Automatically Constructing Software Libraries , 1991, IEEE Trans. Software Eng..

[49]  Charles L. A. Clarke,et al.  X-Site: a workplace search tool for software engineers , 2007, SIGIR.

[50]  Marios D. Dikaiakos,et al.  Effective Keyword Search for Software Resources Installed in Large-Scale Grid Infrastructures , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[51]  Leonard J. Bass,et al.  Evaluating the Software Architecture Competence of Organizations , 2008, Seventh Working IEEE/IFIP Conference on Software Architecture (WICSA 2008).

[52]  Ian T. Foster,et al.  The anatomy of the grid: enabling scalable virtual organizations , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[53]  Mohamed Jmaiel,et al.  SEC+: an enhanced search engine for component-based software development , 2007, SOEN.

[54]  Yong Yu,et al.  Optimizing web search using social annotations , 2007, WWW '07.

[55]  Carmel Domshlak,et al.  On ranking techniques for desktop search , 2007, WWW '07.

[56]  Joe Weinman,et al.  The future of Cloud Computing , 2011, 2011 IEEE Technology Time Machine Symposium on Technologies Beyond 2020.

[57]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[58]  Barry Smyth,et al.  Supporting intelligent Web search , 2007, TOIT.