论文信息 - Minersoft: Software retrieval in grid and cloud computing infrastructures

Minersoft: Software retrieval in grid and cloud computing infrastructures

One of the main goals of Cloud and Grid infrastructures is to make their services easily accessible and attractive to end-users. In this article we investigate the problem of supporting keyword-based searching for the discovery of software files that are installed on the nodes of large-scale, federated Grid and Cloud computing infrastructures. We address a number of challenges that arise from the unstructured nature of software and the unavailability of software-related metadata on large-scale networked environments. We present Minersoft, a harvester that visits Grid/Cloud infrastructures, crawls their file systems, identifies and classifies software files, and discovers implicit associations between them. The results of Minersoft harvesting are encoded in a weighted, typed graph, called the Software Graph. A number of information retrieval (IR) algorithms are used to enrich this graph with structural and content associations, to annotate software files with keywords and build inverted indexes to support keyword-based searching for software. Using a real testbed, we present an evaluation study of our approach, using data extracted from production-quality Grid and Cloud computing infrastructures. Experimental results show that Minersoft is a powerful tool for software search and discovery.

[1] Yi Zhang,et al. Searching and navigating petabyte-scale file systems based on facets , 2007, PDSW '07.

[2] Jeannette M. Wing,et al. Specification matching of software components , 1997 .

[3] Karl Gyllstrom,et al. Confluence: enhancing contextual desktop search , 2007, SIGIR.

[4] M. Dikaiakos. Information Services for Large-Scale Grids A Case for a Grid Search Engine , 2005 .

[5] Udi Manber,et al. Finding Similar Files in a Large File System , 1994, USENIX Winter.

[6] Johannes Elmsheuser,et al. Ganga: A tool for computational-task management and easy access to Grid resources , 2009, Comput. Phys. Commun..

[7] Pierre Jouvelot,et al. Semantic file systems , 1991, SOSP '91.

[8] Ian Witten,et al. Data Mining , 2000 .

[9] Frederico Araújo Durão,et al. A cooperative classification mechanism for search and retrieval software components , 2007, SAC '07.

[10] Sanjay Ghemawat,et al. MapReduce: simplified data processing on large clusters , 2008, CACM.

[11] Reidar Conradi,et al. An empirical investigation of software reuse benefits in a large telecom product , 2008, TSEM.

[12] Martin P. Robillard,et al. Topology analysis of software dependencies , 2008, TSEM.

[13] Genny Tortora,et al. Recovering traceability links in software artifact management systems using information retrieval methods , 2007, TSEM.

[14] Carlos Maltzahn,et al. Richer file system metadata using links and attributes , 2005, 22nd IEEE / 13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST'05).

[15] Lee Rainie,et al. The future of cloud computing , 2010 .

[16] Jeannette M. Wing,et al. Specification matching of software components , 1995, TSEM.

[17] Marios D. Dikaiakos,et al. Harvesting Large-Scale Grids for Software Resources , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[18] Mohamed Jmaiel,et al. An Integration Ontology for Components Composition , 2010, Int. J. Web Portals.

[19] Colin Atkinson,et al. Extreme Harvesting: test driven discovery and reuse of software components , 2004, Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, 2004. IRI 2004..

[20] Hai Liu,et al. Web services provision: solutions, challenges and opportunities (invited paper) , 2009, ICUIMC '09.

[21] Zhendong Su,et al. Scalable detection of semantic clones , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[22] Cristina V. Lopes,et al. How Well Do Search Engines Support Code Retrieval on the Web? , 2011, TSEM.

[23] Andrian Marcus,et al. Recovering documentation-to-source-code traceability links using latent semantic indexing , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[24] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[25] Craig A. N. Soules,et al. Connections: using context to enhance file search , 2005, SOSP '05.

[26] Shinji Kusumoto,et al. Ranking significance of software components based on use relations , 2003, IEEE Transactions on Software Engineering.

[27] Sushil Krishna Bajracharya,et al. Sourcerer: mining and searching internet-scale software repositories , 2008, Data Mining and Knowledge Discovery.

[28] Zhi-Hua Zhou,et al. Improving Web search using image snippets , 2008, TOIT.

[29] Randy H. Katz,et al. A view of cloud computing , 2010, CACM.

[30] Charles L. A. Clarke,et al. Novelty and diversity in information retrieval evaluation , 2008, SIGIR '08.

[31] Collin McMillan,et al. A search engine for finding highly relevant applications , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[32] Beng Chin Ooi,et al. The Claremont report on database research , 2008, SGMD.

[33] Dov Dori,et al. A semantic approach to approximate service retrieval , 2007, TOIT.

[34] Antônio Francisco do Prado,et al. A survey on software components search and retrieval , 2004, Proceedings. 30th Euromicro Conference, 2004..

[35] Ian T. Foster,et al. The Anatomy of the Grid: Enabling Scalable Virtual Organizations , 2001, Int. J. High Perform. Comput. Appl..

[36] Daniel Lucrédio,et al. A survey on software components search and retrieval , 2004 .

[37] Christos Faloutsos,et al. Graph evolution: Densification and shrinking diameters , 2006, TKDD.

[38] Bob Jones,et al. The Organization and Management of Grid Infrastructures , 2009, Computer.

[39] Giuliano Antoniol,et al. Recovering Traceability Links between Code and Documentation , 2002, IEEE Trans. Software Eng..

[40] Antonio Brogi,et al. Semantics-based composition-oriented discovery of Web services , 2008, TOIT.

[41] Guangwen Yang,et al. SmartScan : Efficient Metadata Crawl for Storage Management Metadata Querying in Large File Systems , 2010 .

[42] Collin McMillan,et al. Portfolio: finding relevant functions and their usage , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[43] Peter Willett,et al. Readings in information retrieval , 1997 .

[44] Marios D. Dikaiakos,et al. Cloud Computing: Distributed Internet Computing for IT and Scientific Research , 2009, IEEE Internet Computing.

[45] Beng Chin Ooi,et al. EASE: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data , 2008, SIGMOD Conference.

[46] Mark Sanderson,et al. The relationship between IR effectiveness measures and user satisfaction , 2007, SIGIR.

[47] Jaana Kekäläinen,et al. Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[48] Gail E. Kaiser,et al. An Information Retrieval Approach For Automatically Constructing Software Libraries , 1991, IEEE Trans. Software Eng..

[49] Charles L. A. Clarke,et al. X-Site: a workplace search tool for software engineers , 2007, SIGIR.

[50] Marios D. Dikaiakos,et al. Effective Keyword Search for Software Resources Installed in Large-Scale Grid Infrastructures , 2009, 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology.

[51] Leonard J. Bass,et al. Evaluating the Software Architecture Competence of Organizations , 2008, Seventh Working IEEE/IFIP Conference on Software Architecture (WICSA 2008).

[52] Ian T. Foster,et al. The anatomy of the grid: enabling scalable virtual organizations , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[53] Mohamed Jmaiel,et al. SEC+: an enhanced search engine for component-based software development , 2007, SOEN.

[54] Yong Yu,et al. Optimizing web search using social annotations , 2007, WWW '07.

[55] Carmel Domshlak,et al. On ranking techniques for desktop search , 2007, WWW '07.

[56] Joe Weinman,et al. The future of Cloud Computing , 2011, 2011 IEEE Technology Time Machine Symposium on Technologies Beyond 2020.

[57] Hector Garcia-Molina,et al. Parallel crawlers , 2002, WWW.

[58] Barry Smyth,et al. Supporting intelligent Web search , 2007, TOIT.