The data production of scientific studies is growing at a nearly exponential rate (Domon and Aebersold, 2006; Kiebel et al., 2006). This growth leads to challenges in disseminating primary experimental results for peer review and public access, while simultaneously providing information that enables reproducing the studies and/or analyzing the results in a proper context. Recent mandates from various public funding agencies are requiring data release plans be included as a project goal. This requirement is coupled with an increased need for transparency in complex research, as evidenced by the data release policies now being implemented by peer-reviewed journals such as Molecular & Cellular Proteomics (http://mcponline.org/misc/PhiladelphiaGuidelines.dtl). This combination of good scientific citizenship and funding requirements has brought the data distribution issue to the domain of scientific information management researchers.
Most mass spectrometry-based proteomics groups choose to utilize one of the prominent data distribution sites, such as Tranche (Falkner JA, Andrews PC, HUPO Conference 2006. Long Beach, USA, Poster presentation), PRIDE (Martens et al., 2005), NCBI’s Peptidome (Slotta et al., 2009), Human ProteinPedia (Mathivanan et al., 2008), or PeptideAtlas (Desiere et al., 2006). These sites make sense for small or targeted data releases, but for large groups with diverse experimental approaches and myriad biological model systems (e.g. Callister et al., 2008; Kiebel et al., 2006), the choice may not be so clear. Additionally, these sites are aimed at managing and disseminating data that are associated with identifications and do not generally make all the raw data available. This raw data is particularly useful to developers of analysis tools, as well as in cases where the integration of multiple data sources can improve the confidence of a result. Our goal in the construction of this site is to augment these pubic repositories by making available entire sets of raw and processed results along with their associated metadata. This requires that careful considerations be made regarding the design of the site in order to render it useful to the community. Herein, we present an initial version of such a site, referred to as the Biological MS Data and Software Distribution Center, which can be visited at http://omics.pnl.gov. This site leverages vast amounts of pre-existing experimental data and metadata gathered since 2001 and stored in our purpose-built data management system, PRISM (Kiebel et al., 2006).
Design philosophy
The initial intent for the site was simply to provide local researchers with a mechanism for making large sets of experimental results available to both their collaborators and the greater scientific community. This intent was coupled with a desire to organize the data in a hierarchical structure and present results in such a way as to make them readily usable and understandable by researchers who were familiar with the field, but not necessarily experts in our particular methodologies. In addition to presenting the hierarchical metadata, another expectation was providing website users with a capability for downloading large sets of raw and processed instrumental data (greater than single Terabytes).
Omics research at Pacific Northwest National Laboratory (PNNL) involves a number of different collaborations, many of which include bioinformatics components that require large volumes of raw data at all levels of quality to produce accurate results. This system provides one model to support the current needs of these collaborations while also providing the frame-works necessary to build more advanced capabilities. In the past, the information generated by these collaborations has necessitated the shipment of hard drives full of data across the country. Streamlining this aspect of our data delivery process has driven the design of the site’s initial requirements as well as many aspects of its architecture. We currently have over 150 terabytes of raw and processed data in our archives and these developments enable its dissemination.
[1]
Lennart Martens,et al.
PRIDE: The proteomics identifications database
,
2005,
Proteomics.
[2]
Timothy D. Veenstra,et al.
AN ACCURATE MASS TAG STRATEGY FOR QUANTITATIVE AND HIGH THROUGHPUT PROTEOME MEASUREMENTS
,
2002
.
[3]
Richard D. Smith,et al.
Proteomic Analysis of Salmonella enterica Serovar Typhimurium Isolated from RAW 264.7 Macrophages
,
2006,
Journal of Biological Chemistry.
[4]
J. Yates,et al.
Large-scale analysis of the yeast proteome by multidimensional protein identification technology
,
2001,
Nature Biotechnology.
[5]
Ron Edgar,et al.
NCBI Peptidome: a new public repository for mass spectrometry peptide identifications
,
2009,
Nature Biotechnology.
[6]
Navdeep Jaitly,et al.
DAnTE: a statistical tool for quantitative analysis of -omics data
,
2008,
Bioinform..
[7]
Navdeep Jaitly,et al.
DeconMSn: a software tool for accurate parent ion monoisotopic mass determination for tandem mass spectra
,
2008,
Bioinform..
[8]
Navdeep Jaitly,et al.
Decon2LS: An open-source software package for automated processing and visualization of high resolution mass spectrometry data
,
2009,
BMC Bioinformatics.
[9]
Nikola Tolić,et al.
PRISM: A data management system for high‐throughput proteomics
,
2006,
Proteomics.
[10]
Navdeep Jaitly,et al.
VIPER: an advanced software package to support high-throughput LC-MS peptide identification
,
2007,
Bioinform..
[11]
Jesse James Garrett.
Ajax: A New Approach to Web Applications
,
2007
.
[12]
Joshua N. Adkins,et al.
Comparative Bacterial Proteomics: Analysis of the Core Genome Concept
,
2008,
PloS one.
[13]
Fred Heffron,et al.
Analysis of the Salmonella typhimurium Proteome through Environmental Response toward Infectious Conditions*
,
2006,
Molecular & Cellular Proteomics.
[14]
Y. L. Ramachandra,et al.
Human Proteinpedia enables sharing of human protein data
,
2008,
Nature Biotechnology.
[15]
Eric W. Deutsch,et al.
The PeptideAtlas project
,
2005,
Nucleic Acids Res..
[16]
Ruedi Aebersold,et al.
Challenges and Opportunities in Proteomics Data Analysis*
,
2006,
Molecular & Cellular Proteomics.
[17]
Joshua N. Adkins,et al.
MASIC: A software program for fast quantitation and flexible visualization of chromatographic profiles from detected LC-MS(/MS) features
,
2008,
Comput. Biol. Chem..