The Data Preservation Alliance for the Social Sciences (Data-PASS) is a partnership of six major U.S. institutions with a strong focus on archiving social science research. The partnership is supported by an award from the Library of Congress through its National Digital Information Infrastructure and Preservation Program (NDIIPP). The goal of Data-PASS is to acquire and preserve data at-risk of being lost to the research community, from opinion polls, voting records, large-scale surveys, and other social science studies. This paper will discuss three of the significant products that have emerged from this partnership: (1) procedures for identifying and selecting “at risk” digital materials identified by the Partnership (2) the identification of “at-risk” social science data collections from individual researchers, as well as private research organizations, (3) the design and implementation of a shared catalog describing the data holdings of all partners. We conclude with some brief comments on the partners’ future plans to develop an inter-archival syndicated storage service. Data Preservation Alliance for the Social Sciences: A Model for Collaboration [Prepared for DigCCurr 2007] 2 Introduction Until recently many private businesses and university-based researchers have assumed that the data they generated were their property and that they had limited obligations to share their data with others, or to ensure its preservation. Despite this notion, an international movement to archive, preserve, and share data emerged when digital data began to appear in volume. Still, we cannot say that even a majority of the digital social science research content created since the revolution in sample surveys and production of digital data has been preserved. There are a variety of understandable reasons for this lack of attention to preservation. Some individual researchers have been reluctant to deposit their data in archives because they wanted to avoid sharing it with potential competitors. Some lacked the time or expertise to prepare the metadata required for effective sharing. And some investigators simply did not recognize the long term value of their data. Institutional data producers may have been under contractual obligations with those who paid for data collection to protect proprietary information. And some data just fell through the cracks. There remains a vast quantity of digital social science research content that has not been and will not be without aggressive activities by data curators. This content lives on in the computers of individual researchers or of research institutions, or quite possibly in bookcases, libraries, and warehouses. If we do not take steps to preserve it, it will be lost forever, and its value to our society cannot be restored. It needs to be identified, located, assessed, acquired, and preserved. Four major American social science data archives, The Inter-university Consortium for Political and Social Research, The Roper Center for Public Opinion Research, The Howard W. Odum Institute for Research in Social Science, The Henry A. Murray Research Archive, along with the Harvard-MIT Data Center (a leader in digital library research) and the electronic records custodial division of the National Archives and Records Administration (NARA), have created the Data Preservation Alliance for the Social Science (Data-PASS) to ensure the long-term preservation of our holdings and of materials as yet un-archived. . We seek to acquire and preserve data at-risk of being lost to the research community, from opinion polls, voting records, large-scale surveys, and other social science studies. And we work together to identify, appraise, acquire, catalog, and preserve data used for social science research. Identification and Selection While our organizations have a history of collaboration, this official partnership has provided important benefits and taught us a great deal about the advantages of formalized collaborative relationships. Data-PASS is, in part, funded by an award from the U.S. Library of Congress’ National Digital Information Infrastructure and Preservation Program (NDIIPP) [2]. The NDIIPP mission is to develop a national strategy to collect, archive and preserve digital content, especially materials created in digital format. Our project is working to ensure the long-term preservation of the vital heritage of digital material that allows our nation to understand itself, its social organization, and its policies and politics through social science research. 1 The Data-PASS project website is: http://www.icpsr.org/DATAPASS/ . All of the good practices documentation developed in this project, including the identification, appraisal and metadata practices are available from: http://www.icpsr.org/DATAPASS/about.html . The shared catalog is available from http://vdc.hmdc.harvard.edu/dataverse/DATAPASS/ , Data Preservation Alliance for the Social Sciences: A Model for Collaboration [Prepared for DigCCurr 2007] 3 Adopting common standards for any collaborative effort lays the groundwork for those relationships to grow and prosper. The Data-PASS partnership permits a much higher level of inter-archival cooperation, including mutually agreed-upon identification and appraisal policies. The potential volume of information which could be acquired and the need to make the most cost-effective use of limited resources have emphasized the need for selection standards. The current focus of our project is to identify the most significant digital social science data of the past seventy-five years. We start with the premise that any social science data that is not currently in a permanent archive is considered to be at risk of being lost. If data are available at an alternative site and if there is confidence that availability will continue over time, the risk of loss is diminished. An operations committee, with representatives from each partnering organization, developed common standards that are used to identify and select data for inclusion. These criteria incorporate elements of accepted archival practice to identify the most important content to preserve and an evaluation of the risk of losing the content should acquisition not take place. The appraisal guidelines include significance of the data to the research community, significance of the source and context of data, and the uniqueness and usability of the data. The identification and selection process is somewhat decentralized with each archive pursuing data that best represent its content area of specialization. This decentralization allows each partner to leverage their distinct capabilities in specific kinds and sources of data. However, the information gathered regarding specific data collections is brought to the committee to determine how best to proceed. Together, we try to determine if the data are from studies that were theoretically and/or methodologically groundbreaking. Other data collections of interest are from studies that are part of a seminal collection or tied to unrepeatable or rare events. We also determine if the data is highly cited in the social sciences or conducted by highly cited social scientists. As part of this process, we communicate with the producers of the data to determine their willingness to archive their data. Building and maintaining the relationships between data producers and data archives are among the most important tasks an archivist has. Those who deposit their data with archives must trust that the archive will value and preserve the information they provide and ensure that the data will remain accessible over time (Crabtree and Donakowski, 2006). Many of the data producers we encounter already know the value that each of the partnering organizations places on data preservation. Through Data-PASS, this commitment to preservation is made stronger by mutual agreements to share preservation and dissemination obligations. Another factor influencing our interactions with individual data producers and non-profit research organizations alike is the set of data-sharing policies adopted by sponsors of research activity, such as NSF and NIH. By depositing their data in a digital archive, researchers can fulfill grant obligations that require that funded research be made available to the research community. In addition, they can avoid the administrative tasks associated with ensuring the safekeeping of the data. Depositing their data also enables researchers to demonstrate continued use of the data after the original research is completed, which can improve their prospects of securing further research money. Federally Funded At-Risk Materials Data Preservation Alliance for the Social Sciences: A Model for Collaboration [Prepared for DigCCurr 2007] 4 Some federal funding agencies stipulate that data collected using their funds should be made available and shared with other researchers. The National Science Foundation, in its Grant Proposal Guide, states that it “expects PIs to share with other researchers, at no more than incremental cost and within a reasonable time, the data, samples, physical collections and other supporting materials created or gathered in the course of the work.” (NSF, 2004). The National Institutes of Health (NIH) state in its Statement on Sharing Research Data that “data sharing is essential for expedited translation of research results into knowledge, products, and procedures to improve human health”, and it “endorses the sharing of final research data to serve these and other important scientific goals” (NIH, 2003). In addition, any data that is produced under a federal contract is formally a federal record, and is subject to review for preservation by NARA. This federally funded research is a main focus of our partnership. One of ICPSR’s roles in this partnership is to review the National Science Foundation (NSF) database. We are also reviewing the Computer Retrieval of Information on Scie
[1]
Victoria Reich.
LOCKSS (lots of copies keep stuff safe)
,
2006,
iPRES.
[2]
Micah Altman,et al.
A Digital Library for the Dissemination and Replication of Quantitative Social Science Research
,
2001
.
[3]
Micah Altman,et al.
Numerical Issues in Statistical Computing for the Social Scientist
,
2003
.
[4]
Micah Altman,et al.
A Proposed Standard for the Scholarly Citation of Quantitative Data
,
2008,
IASSIST Conference.
[5]
Gary King,et al.
Zelig: Everyone's Statistical Software
,
2006
.
[6]
Ian T. Foster,et al.
Globus Toolkit Version 4: Software for Service-Oriented Systems
,
2005,
Journal of Computer Science and Technology.