The Modern Research Data Portal: a design pattern for networked, data-intensive science

We describe best practices for providing convenient, high-speed, secure access to large data via research data portals. We capture these best practices in a new design pattern, theModern Research Data Portal, that disaggregates the traditional monolithic web-based data portal to achieve orders-of-magnitude increases in data transfer performance, support new deployment architectures that decouple control logic from data storage, and reduce development and operations costs. We introduce the design pattern; explain how it leverages high-performance data enclaves and cloud-based data management services; review representative examples at research laboratories and universities, including both experimental facilities and supercomputer sites; describe how to leverage Python APIs for authentication, authorization, data transfer, and data sharing; and use coding examples to demonstrate how these APIs can be used to implement a range of research data portal capabilities. Sample code at a companion web site, https://docs.globus.org/mrdp, provides application skeletons that readers can adapt to realize their own research data portals. Subjects Computer Networks and Communications, Data Science, Distributed and Parallel Computing, Security and Privacy, World Wide Web and Web Science

[1]  Dick Hardt,et al.  The OAuth 2.0 Authorization Framework , 2012, RFC.

[2]  Yadu N. Babuji,et al.  Cloud Kotta: Enabling secure and scalable data analytics in the cloud , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[3]  Marlon E. Pierce,et al.  Apache Airavata: Design and Directions of a Science Gateway Framework , 2014, 2014 6th International Workshop on Science Gateways.

[4]  Tom Kelly,et al.  Scalable TCP: improving performance in highspeed wide area networks , 2003, CCRV.

[5]  A. D. Meglio,et al.  Programming the Grid with gLite , 2006 .

[6]  B. S. Manjunath,et al.  The iPlant Collaborative: Cyberinfrastructure for Plant Biology , 2011, Front. Plant Sci..

[7]  Foster Ian,et al.  Globus auth: A research identity and access management platform , 2016 .

[8]  D. Martin Swany,et al.  PerfSONAR: A Service Oriented Architecture for Multi-domain Network Monitoring , 2005, ICSOC.

[9]  ArtemTrunov,et al.  Peer—to—Peer Computing for secure High Performance Data Copying , 2001 .

[10]  Craig A. Stewart,et al.  A roadmap for using NSF cyberinfrastructure with InCommon , 2011 .

[11]  Peter Wittenburg,et al.  EUDAT: A New Cross-Disciplinary Data Infrastructure for Science , 2013, Int. J. Digit. Curation.

[12]  Eli Dart,et al.  The Science DMZ: A network design pattern for data-intensive science , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[13]  Fernando Paganini,et al.  FAST TCP: from theory to experiments , 2005, IEEE Network.

[14]  Brian D. Noble,et al.  Improving throughput and maintaining fairness using parallel TCP , 2004, IEEE INFOCOM 2004.

[15]  Reagan Moore,et al.  iRODS Primer: Integrated Rule-Oriented Data System , 2010, iRODS Primer.

[16]  Daniel J. Crichton,et al.  A classification and evaluation of data movement technologies for the delivery of highly voluminous scientific data products , 2006 .

[17]  Ian T. Foster,et al.  Globus Data Publication as a Service: Lowering Barriers to Reproducible Science , 2015, 2015 IEEE 11th International Conference on e-Science.

[18]  Tony Hey,et al.  The Fourth Paradigm , 2009 .

[19]  Michael A. Cusumano,et al.  Cloud computing and SaaS as new computing platforms , 2010, CACM.

[20]  Robert L. Grossman,et al.  UDT: UDP-based data transfer for high-speed wide area networks , 2007, Comput. Networks.

[21]  Tim Berners-Lee,et al.  Information Management: A Proposal , 1990 .

[22]  Prasanna Balaprakash,et al.  Explaining Wide Area Data Transfer Performance , 2017, HPDC.

[23]  C. Tenopir,et al.  Data Sharing by Scientists: Practices and Perceptions , 2011, PloS one.

[24]  Ian T. Foster,et al.  Efficient and Secure Transfer, Synchronization, and Sharing of Big Data , 2014, IEEE Cloud Computing.

[25]  Nancy Wilkins-Diehr,et al.  TeraGrid Science Gateways and Their Impact on Science , 2008, Computer.

[26]  Alan M. Kwong,et al.  A reference panel of 64,976 haplotypes for genotype imputation , 2015, Nature Genetics.

[27]  Dhabaleswar K. Panda,et al.  High Performance Data Transfer in Grid Environment Using GridFTP over InfiniBand , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[28]  Ian T. Foster,et al.  Globus Nexus: A Platform-as-a-Service provider of research identity, profile, and group management , 2016, Future Gener. Comput. Syst..

[29]  Michael McLennan,et al.  HUBzero: A Platform for Dissemination and Collaboration in Computational Science and Engineering , 2010, Computing in Science & Engineering.

[30]  Gerhard Klimeck,et al.  nanoHUB.org: Advancing Education and Research in Nanotechnology , 2008, Computing in Science & Engineering.

[31]  Chase Qishi Wu,et al.  Experimental Analysis of File Transfer Rates over Wide-Area Dedicated Connections , 2016, 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[32]  Ricky Egeland,et al.  PhEDEx Data Service , 2010 .

[33]  Piotr Sliz,et al.  Collaboration gets the most out of software , 2013, eLife.

[34]  Christine L. Borgman,et al.  The conundrum of sharing research data , 2012, J. Assoc. Inf. Sci. Technol..

[35]  Nancy Wilkins-Diehr,et al.  Science gateways today and tomorrow: positive perspectives of nearly 5000 members of the research community , 2015, Concurr. Comput. Pract. Exp..

[36]  Oleg V. Tsodikov,et al.  Data publication with the structural biology data grid supports live analysis , 2016, Nature Communications.

[37]  Rion Dooley,et al.  Software-as-a-Service: The iPlant Foundation API , 2012 .

[38]  Mercè Crosas,et al.  The Dataverse Network®: An Open-Source Application for Sharing, Discovering and Preserving Data , 2011, D Lib Mag..