The Modern Research Data Portal: a design pattern for networked, data-intensive science

Author(s): Chard, K; Dart, E; Foster, I; Shifflett, D; Tuecke, S; Williams, J | Abstract: © 2018 Chard et al. We describe best practices for providing convenient, high-speed, secure access to large data via research data portals. We capture these best practices in a new design pattern, the Modern Research Data Portal, that disaggregates the traditional monolithic web-based data portal to achieve orders-of-magnitude increases in data transfer performance, support new deployment architectures that decouple control logic from data storage, and reduce development and operations costs. We introduce the design pattern; explain how it leverages high-performance data enclaves and cloud-based data management services; review representative examples at research laboratories and universities, including both experimental facilities and supercomputer sites; describe how to leverage Python APIs for authentication, authorization, data transfer, and data sharing; and use coding examples to demonstrate how these APIs can be used to implement a range of research data portal capabilities. Sample code at a companion web site, https://docs.globus.org/mrdp, provides application skeletons that readers can adapt to realize their own research data portals.

[1]  Mercè Crosas,et al.  The Dataverse Network®: An Open-Source Application for Sharing, Discovering and Preserving Data , 2011, D Lib Mag..

[2]  Oleg V. Tsodikov,et al.  Data publication with the structural biology data grid supports live analysis , 2016, Nature Communications.

[3]  William E. Allcock,et al.  The Globus Striped GridFTP Framework and Server , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[4]  Brian D. Noble,et al.  Improving throughput and maintaining fairness using parallel TCP , 2004, IEEE INFOCOM 2004.

[5]  Nancy Wilkins-Diehr,et al.  Science gateways today and tomorrow: positive perspectives of nearly 5000 members of the research community , 2015, Concurr. Comput. Pract. Exp..

[6]  Jim Basney,et al.  An OAuth service for issuing certificates to science gateways for TeraGrid users , 2011, TG.

[7]  Nancy Wilkins-Diehr,et al.  TeraGrid Science Gateways and Their Impact on Science , 2008, Computer.

[8]  C. Tenopir,et al.  Data Sharing by Scientists: Practices and Perceptions , 2011, PloS one.

[9]  Craig A. Stewart,et al.  A roadmap for using NSF cyberinfrastructure with InCommon , 2011 .

[10]  Alan M. Kwong,et al.  A reference panel of 64,976 haplotypes for genotype imputation , 2015, Nature Genetics.

[11]  Dhabaleswar K. Panda,et al.  High Performance Data Transfer in Grid Environment Using GridFTP over InfiniBand , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[12]  Ian T. Foster,et al.  Globus Nexus: A Platform-as-a-Service provider of research identity, profile, and group management , 2016, Future Gener. Comput. Syst..

[13]  Ian T. Foster,et al.  Efficient and Secure Transfer, Synchronization, and Sharing of Big Data , 2014, IEEE Cloud Computing.

[14]  Prasanna Balaprakash,et al.  Explaining Wide Area Data Transfer Performance , 2017, HPDC.

[15]  Peter Wittenburg,et al.  EUDAT: A New Cross-Disciplinary Data Infrastructure for Science , 2013, Int. J. Digit. Curation.

[16]  Ricky Egeland,et al.  PhEDEx Data Service , 2010 .

[17]  Piotr Sliz,et al.  Collaboration gets the most out of software , 2013, eLife.

[18]  Christine L. Borgman,et al.  The conundrum of sharing research data , 2012, J. Assoc. Inf. Sci. Technol..

[19]  A. D. Meglio,et al.  Programming the Grid with gLite , 2006 .

[20]  Ian T. Foster,et al.  Globus auth: A research identity and access management platform , 2016, 2016 IEEE 12th International Conference on e-Science (e-Science).

[21]  Reagan Moore,et al.  iRODS Primer: Integrated Rule-Oriented Data System , 2010, iRODS Primer.

[22]  Daniel J. Crichton,et al.  A classification and evaluation of data movement technologies for the delivery of highly voluminous scientific data products , 2006 .

[23]  D. Martin Swany,et al.  PerfSONAR: A Service Oriented Architecture for Multi-domain Network Monitoring , 2005, ICSOC.

[24]  John Shalf,et al.  The Astrophysics Simulation Collaboratory: A Science Portal Enabling Community Software Development , 2004, Cluster Computing.

[25]  Yadu N. Babuji,et al.  Cloud Kotta: Enabling secure and scalable data analytics in the cloud , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[26]  Chase Qishi Wu,et al.  Experimental Analysis of File Transfer Rates over Wide-Area Dedicated Connections , 2016, 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[27]  Fernando Paganini,et al.  FAST TCP: from theory to experiments , 2005, IEEE Netw..

[28]  Andrew Hanushevsky,et al.  Peer-to-Peer Computing for Secure High Performance Data Copying , 2002 .

[29]  Michael A. Cusumano,et al.  Cloud computing and SaaS as new computing platforms , 2010, CACM.

[30]  Robert L. Grossman,et al.  UDT: UDP-based data transfer for high-speed wide area networks , 2007, Comput. Networks.

[31]  Tim Berners-Lee,et al.  Information Management: A Proposal , 1990 .

[32]  Marlon E. Pierce,et al.  Apache Airavata: Design and Directions of a Science Gateway Framework , 2014, 2014 6th International Workshop on Science Gateways.

[33]  Tom Kelly,et al.  Scalable TCP: improving performance in highspeed wide area networks , 2003, CCRV.

[34]  Michael McLennan,et al.  HUBzero: A Platform for Dissemination and Collaboration in Computational Science and Engineering , 2010, Computing in Science & Engineering.

[35]  Gerhard Klimeck,et al.  nanoHUB.org: Advancing Education and Research in Nanotechnology , 2008, Computing in Science & Engineering.

[36]  B. S. Manjunath,et al.  The iPlant Collaborative: Cyberinfrastructure for Plant Biology , 2011, Front. Plant Sci..

[37]  Eli Dart,et al.  The Science DMZ: A network design pattern for data-intensive science , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[38]  Ian T. Foster,et al.  Globus Data Publication as a Service: Lowering Barriers to Reproducible Science , 2015, 2015 IEEE 11th International Conference on e-Science.