Managing data within the HUBzero™ platform.

During the past 6 years, I have had the opportunity to work with a wide range of projects. Our HUBzero software platform currently powers nanoHUB.org and 25 other Websites or ‘‘hubs’’ used for both research and education (McLennan and Kennell, 2010). These hubs support many different areas of science and engineering, including nanotechnology, microelectromechanical systems, cancer care, bio-fuels, earthquake engineering, volcanic activity, pharmaceutical engineering, environmental modeling, battery technology for electric vehicles, and heat transfer applications. All together, these Websites have served more than 390,000 unique visitors during the past 12 months. Through our experiences with these projects, we have found at least two different approaches for managing data. Researchers use unstructured data, such as project notes held in a wiki, collections of PowerPoint files and reports, program code, and other text files to represent data for their projects. Unstructured data are usually stored on a local desktop or may be managed by facilities such as Confluence, SharePoint, Google Groups, and Dropbox. HUBzero supports aspects of all of these facilities. Within a hub, users can create private groups and invite other users to join. In this mode, a professor may work together with his graduate students, or undergraduate students may work together on a class project, or colleagues may work together to develop a research proposal. Users leverage a collaborative wiki for project notes, a discussion forum, and a Subversion repository for editing/versioning of source code, experimental results, and other project data. The data kept within these projects are available only to members allowed to join the group. Data sets, summaries, and other products shaped within this private space may be published as desired and made available to the general public under Creative Commons and other license terms. Researchers also share structured data, such as spreadsheets, protein databank files, satellite images, and telescope data. A few hubs have begun to develop relational databases coupled with tools for data mining within HUBzero. For example, the Cancer Care Engineering project at cceHUB.org has created a database for tracking blood samples, clinical patient data, and related proteomics and metabolomics datasets. Researchers at the IU School of Medicine collect the blood samples and upload patient data using forms such as the one shown in Figure 1. Other researchers mine the data by applying statistical analysis tools within the hub environment. This facility was created in such a way that it can be quickly customized and transferred to other projects. Recently, the same code was used to create a thermal properties database on thermalHUB .org. Similar code is being developed to support earthquake engineering data for the nees.org project. Perhaps the most compelling feature of HUBzero is the way it helps researchers create and publish interactive simulation tools, along with seminars, tutorials, teaching materials, and other supporting resources—all in a way that can be accessed via an ordinary Web browser. New tools can be constructed in a matter of hours using HUBzero’s Rappture toolkit, as shown in Figure 2. Developers start by building a description of the inputs/ outputs for their tool. In the past, this has been done by coding the description directly in an XML document, but the new ‘‘Instant Rappture’’ builder makes this process even easier for domain scientists. Researchers drag items from a palette of objects into their tool description, specify labels, units of measure, and other semantic information for each object, then save the resulting description in a file named tool.xml. A user launches the tool, and Rappture reads the XML description and renders a graphical interface with controls for the various input elements. The user sets parameter values and presses the Simulate button. Rappture then executes the underlying simulation code, which may run on the same local machine, or be sent to a cluster or another Grid resource for computation. Once the simulation has finished, Rappture loads the results into the GUI so the user can visualize them and explore them. Rappture captures provenance about the simulation run, including who ran the simulation, when it was run, what machine computed the result, and even when the tool was compiled/installed and the precise revision number from the source code control system. Rappture contains application programming interfaces (APIs) for C/Cþþ, Fortran, MATLAB, Python, Perl, Ruby, Tcl, and Java, so researchers can add a Rappture GUI to a wide variety of existing codes and continue to work with their favorite programming language. Rappture can be used to drive simulators written with MPI and other parallelization libraries, and it can send simulations off to remote execution hosts