All-Pairs: An Abstraction for Data-Intensive Computing on Campus Grids

Today, campus grids provide users with easy access to thousands of CPUs. However, it is not always easy for nonexpert users to harness these systems effectively. A large workload composed in what seems to be the obvious way by a naive user may accidentally abuse shared resources and achieve very poor performance. To address this problem, we argue that campus grids should provide end users with high-level abstractions that allow for the easy expression and efficient execution of data-intensive workloads. We present one example of an abstraction-All-Pairs-that fits the needs of several applications in biometrics, bioinformatics, and data mining. We demonstrate that an optimized All-Pairs abstraction is both easier to use than the underlying system, achieve performance orders of magnitude better than the obvious but naive approach, and is both faster and more efficient than a tuned conventional approach. This abstraction has been in production use for one year on a 500 CPU campus grid at the University of Notre Dame and has been used to carry out a groundbreaking analysis of biometric data.

[1]  Mahadev Satyanarayanan,et al.  Diamond: A Storage Architecture for Early Discard in Interactive Search , 2004, FAST.

[2]  Miron Livny,et al.  Condor and the Grid , 2003 .

[3]  Nicholas Carriero,et al.  Linda and Friends , 1986, Computer.

[4]  Richard D. Schlichting,et al.  Tolerating failures in the bag-of-tasks programming paradigm , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[5]  Yong Zhao,et al.  Chimera: a virtual data system for representing, querying, and automating data derivation , 2002, Proceedings 14th International Conference on Scientific and Statistical Database Management.

[6]  Douglas Thain,et al.  All-pairs: An abstraction for data-intensive cloud computing , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[7]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[8]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[9]  Ian T. Foster,et al.  Globus: a Metacomputing Infrastructure Toolkit , 1997, Int. J. High Perform. Comput. Appl..

[10]  Yong Zhao,et al.  A notation and system for expressing and executing cleanly typed workflows on messy scientific data , 2005, SGMD.

[11]  Jorge Luis Rodriguez,et al.  The Open Science Grid , 2005 .

[12]  William J. Bolosky,et al.  A large-scale study of file-system contents , 1999, SIGMETRICS '99.

[13]  David P. Anderson,et al.  A new major SETI project based on Project Serendip data and 100 , 1997 .

[14]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[15]  Francisco Vilar Brasileiro,et al.  Trading Cycles for Information: Using Replication to Schedule Bag-of-Tasks Applications on Computational Grids , 2003, Euro-Par.

[16]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[17]  John Daugman,et al.  How iris recognition works , 2002, IEEE Transactions on Circuits and Systems for Video Technology.

[18]  G. Weinstock,et al.  The Atlas genome assembly system. , 2004, Genome research.

[19]  Jeff T. Linderoth,et al.  An enabling framework for master-worker applications on the Computational Grid , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.

[20]  Patrick J. Flynn,et al.  Overview of the face recognition grand challenge , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[21]  Michael J. Franklin,et al.  The Design of GridDB: A Data-Centric Overlay for the Scientific Grid , 2004, VLDB.

[22]  Patrick J. Flynn,et al.  Image understanding for iris biometrics: A survey , 2008, Comput. Vis. Image Underst..

[23]  L. Hillier,et al.  PCAP: a whole-genome assembly program. , 2003, Genome research.

[24]  W. Walker,et al.  Mpi: a Standard Message Passing Interface 1 Mpi: a Standard Message Passing Interface , 1996 .

[25]  Andrea C. Arpaci-Dusseau,et al.  High-performance sorting on networks of workstations , 1997, SIGMOD '97.

[26]  Joel H. Saltz,et al.  DataCutter: Middleware for Filtering Very Large Scientific Datasets on Archival Storage Systems , 2000, IEEE Symposium on Mass Storage Systems.

[27]  Douglas Thain,et al.  Chirp: a practical global filesystem for cluster and Grid computing , 2008, Journal of Grid Computing.

[28]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[29]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[30]  Marc Najork,et al.  Boxwood: Abstractions as the Foundation for Storage Infrastructure , 2004, OSDI.

[31]  Jimmy J. Lin,et al.  Pairwise Document Similarity in Large Collections with MapReduce , 2008, ACL.

[32]  Dan Walsh,et al.  Design and implementation of the Sun network filesystem , 1985, USENIX Conference Proceedings.

[33]  Douglas Thain,et al.  How to measure a large open‐source distributed system , 2006, Concurr. Comput. Pract. Exp..

[34]  David E. Culler,et al.  Scalable, Distributed Data Structures for Internet Service Construction , 2000, OSDI.

[35]  David A. Patterson,et al.  Technical perspective: the data center is the computer , 2008, CACM.

[36]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[37]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[38]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[39]  Leslie G. Valiant,et al.  Bulk synchronous parallel computing-a paradigm for transportable software , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.