Data capture in bioinformatics: requirements and experiences with Pedro

BackgroundThe systematic capture of appropriately annotated experimental data is a prerequisite for most bioinformatics analyses. Data capture is required not only for submission of data to public repositories, but also to underpin integrated analysis, archiving, and sharing – both within laboratories and in collaborative projects. The widespread requirement to capture data means that data capture and annotation are taking place at many sites, but the small scale of the literature on tools, techniques and experiences suggests that there is work to be done to identify good practice and reduce duplication of effort.ResultsThis paper reports on experience gained in the deployment of the Pedro data capture tool in a range of representative bioinformatics applications. The paper makes explicit the requirements that have recurred when capturing data in different contexts, indicates how these requirements are addressed in Pedro, and describes case studies that illustrate where the requirements have arisen in practice.ConclusionData capture is a fundamental activity for bioinformatics; all biological data resources build on some form of data capture activity, and many require a blend of import, analysis and annotation. Recurring requirements in data capture suggest that model-driven architectures can be used to construct data capture infrastructures that can be rapidly configured to meet the needs of individual use cases. We have described how one such model-driven infrastructure, namely Pedro, has been deployed in representative case studies, and discussed the extent to which the model-driven approach has been effective in practice.

[1]  Rolf Apweiler,et al.  The Proteomics Identifications Database (PRIDE) and the ProteomExchange Consortium: making proteomics data accessible , 2006, Expert review of proteomics.

[2]  Paul J. Walmsley,et al.  XML Schema Part 0: Primer Second Edition , 2004 .

[3]  Dan Brickley,et al.  Rdf vocabulary description language 1.0 : Rdf schema , 2004 .

[4]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[5]  Michael Y. Galperin The Molecular Biology Database Collection: 2007 update , 2006, Nucleic Acids Res..

[6]  D B Kell,et al.  Oscillations in NF-kappaB signaling control the dynamics of gene expression. , 2004, Science.

[7]  Paul T. Spellman,et al.  A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB , 2006, BMC Bioinformatics.

[8]  Carole A. Goble,et al.  Feta: A Light-Weight Architecture for User Oriented Semantic Service Discovery , 2005, ESWC.

[9]  C. Ball,et al.  Genetic and physical maps of Saccharomyces cerevisiae. , 1997, Nature.

[10]  Helen E. Parkinson,et al.  ArrayExpress—a public database of microarray experiments and gene expression profiles , 2006, Nucleic Acids Res..

[11]  Harald Schöning Tamino - A Database System Combining Text Retrieval and XML , 2003, Intelligent Search on XML Data.

[12]  James R. Johnson,et al.  Oscillations in NF-κB Signaling Control the Dynamics of Gene Expression , 2004, Science.

[13]  Chris F. Taylor,et al.  A systematic approach to modeling, capturing, and disseminating proteomics experimental data , 2003, Nature Biotechnology.

[14]  Morris A. Swertz,et al.  Beyond standardization: dynamic software infrastructures for systems biology , 2007, Nature Reviews Genetics.

[15]  Ron Edgar,et al.  Gene Expression Omnibus ( GEO ) : Microarray data storage , submission , retrieval , and analysis , 2008 .

[16]  Norman W. Paton,et al.  Model-driven user interfaces for bioinformatics data resources: regenerating the wheel as an alternative to reinventing it , 2006, BMC Bioinformatics.

[17]  Jason E. Stewart,et al.  Design and implementation of microarray gene expression markup language (MAGE-ML) , 2002, Genome Biology.

[18]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[19]  Carole A. Goble,et al.  Exploring Williams-Beuren syndrome using myGrid , 2004, ISMB/ECCB.

[20]  Norman W. Paton,et al.  Automated tracking of gene expression in individual cells and cell compartments , 2006, Journal of The Royal Society Interface.

[21]  Jason E. Stewart,et al.  Minimum information about a microarray experiment (MIAME)—toward standards for microarray data , 2001, Nature Genetics.

[22]  Douglas B. Kell,et al.  maxdLoad2 and maxdBrowse: standards-compliant tools for microarray experimental annotation, data management and dissemination , 2005, BMC Bioinformatics.

[23]  Chris F. Taylor,et al.  Pedro: a configurable data entry tool for XML , 2004, Bioinform..

[24]  Norman W. Paton,et al.  Teallach - a flexible user-interface development environment for object database applications , 2003, J. Vis. Lang. Comput..