Prioritization of data quality dimensions and skills requirements in genome annotation work

The rapid accumulation of genome annotations, as well as their widespread reuse in clinical and scientific practice, poses new challenges to management of the quality of scientific data. This study contributes towards better understanding of scientists' perceptions of and priorities for data quality and data quality assurance skills needed in genome annotation. This study was guided by a previously developed general framework for assessment of data quality and by a taxonomy of data-quality (DQ) skills, and intended to define context-sensitive models of criteria for data quality and skills for genome annotation. Analysis of the results revealed that genomics scientists recognize specific sets of criteria for quality in the genome-annotation context. Seventeen data quality dimensions were reduced to 5-factor constructs, and 17 relevant skills were grouped into 4-factor constructs. The constructs defined by this study advance the understanding of data quality relationships and are an important contribution to data and information quality research. In addition, the resulting models can serve as valuable resources to genome data curators and administrators for developing data-curation policies and designing DQassurance strategies, processes, procedures, and infrastructure. The study's findings may also inform educators in developing data quality assurance curricula and training courses.

[1]  F. Crick On protein synthesis. , 1958, Symposia of the Society for Experimental Biology.

[2]  Lincoln Stein,et al.  Genome annotation: from sequence to biology , 2001, Nature Reviews Genetics.

[3]  Alla Lapidus,et al.  A Bioinformatician's Guide to Metagenomics , 2008, Microbiology and Molecular Biology Reviews.

[4]  Ute Baumann,et al.  Estimating the annotation error rate of curated GO database sequence annotations , 2007, BMC Bioinformatics.

[5]  A. Valencia,et al.  Intrinsic errors in genome annotation. , 2001, Trends in genetics : TIG.

[6]  Igor V Tetko,et al.  Separation of sequences from host-pathogen interface using triplet nucleotide frequencies. , 2007, Fungal genetics and biology : FG & B.

[7]  Elaine G. Toms,et al.  Developing a protocol for bioinformatics analysis: An integrated information behavior and task analysis approach: Research Articles , 2005 .

[8]  S. Brenner Errors in genome annotation. , 1999, Trends in genetics : TIG.

[9]  Charles C. Kim,et al.  Significance analysis of lexical bias in microarray data , 2003, BMC Bioinformatics.

[10]  Soo Young Rieh Judgement of information quality and cognitive authority in the Web , 2002 .

[11]  Tao Wang,et al.  Social Networkers' Attitudes Toward Direct-to-Consumer Personal Genome Testing , 2009, The American journal of bioethics : AJOB.

[12]  B. Nardi Activity theory and human-computer interaction , 1995 .

[13]  Kentaro Go,et al.  Scenario-Based Task Analysis , 2003 .

[14]  Danette McGilvray,et al.  Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information TM , 2008 .

[15]  B. Palsson,et al.  Towards multidimensional genome annotation , 2006, Nature Reviews Genetics.

[16]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[17]  B. Nardi Context and consciousness: activity theory and human-computer interaction , 1995 .

[18]  Diane M. Strong,et al.  Data quality in context , 1997, CACM.

[19]  Carole A. Goble,et al.  A classification of tasks in bioinformatics , 2001, Bioinform..

[20]  Eva Huala,et al.  Community-based gene structure annotation. , 2005, Trends in plant science.

[21]  Richard Y. Wang,et al.  Journey to Data Quality , 2006 .

[22]  Les Gasser,et al.  An activity theoretic model for information quality change , 2008, First Monday.

[23]  Xianggui Qu,et al.  Multivariate Data Analysis , 2007, Technometrics.

[24]  Edward Curry,et al.  The Role of Community-Driven Data Curation for Enterprises , 2010, Linking Enterprise Data.

[25]  Alexander A. Morgan,et al.  Data preparation and interannotator agreement: BioCreAtIvE Task 1B , 2005, BMC Bioinformatics.

[26]  Besiki Stvilia Measuring Information Quality , 2006 .

[27]  Les Gasser,et al.  A framework for information quality assessment , 2007, J. Assoc. Inf. Sci. Technol..

[28]  Les Gasser,et al.  Information quality work organization in wikipedia , 2008, J. Assoc. Inf. Sci. Technol..

[29]  Richard Y. Wang,et al.  Anchoring data quality dimensions in ontological foundations , 1996, CACM.

[30]  Besiki Stvilia,et al.  A model for online consumer health information quality , 2009, J. Assoc. Inf. Sci. Technol..

[31]  Samuel V. Angiuoli,et al.  Toward an online repository of Standard Operating Procedures (SOPs) for (meta)genomic annotation. , 2008, Omics : a journal of integrative biology.

[32]  James R. Evans,et al.  The management and control of quality , 1989 .

[33]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[34]  A. N. Leont’ev,et al.  Activity, consciousness, and personality , 1978 .

[35]  Besiki Stvilia,et al.  A model for ontology quality evaluation , 2007, First Monday.

[36]  Irving M. Klotz,et al.  Symposia of the Society for Experimental Biology , 1952, The Yale Journal of Biology and Medicine.

[37]  Don Fallis,et al.  Indicators of accuracy for answers to ready reference questions on the Internet , 2004, J. Assoc. Inf. Sci. Technol..

[38]  L. Vygotsky,et al.  The Development of Higher Forms of Attention in Childhood , 1979 .

[39]  Diane M. Strong,et al.  Knowing-Why About Data Processes and Data Quality , 2004 .

[40]  R. David Lankes,et al.  Credibility on the internet: shifting from authority to reliability , 2008, J. Documentation.

[41]  Richard Y. Wang,et al.  What Skills Matter in Data Quality? , 2002, ICIQ.

[42]  StviliaBesiki,et al.  Prioritization of data quality dimensions and skills requirements in genome annotation work , 2012 .

[43]  Jon W. Huss,et al.  A Gene Wiki for Community Annotation of Gene Function , 2008, PLoS biology.

[44]  Jill P. Mesirov,et al.  Improving genome annotations using phylogenetic profile anomaly detection , 2005, Bioinform..

[45]  Tom Hsiang,et al.  Distinguishing plant and fungal sequences in ESTs from infected plant tissues. , 2003, Journal of microbiological methods.

[46]  Fangfang Xia,et al.  The National Microbial Pathogen Database Resource (NMPDR): a genomics platform based on subsystem annotation , 2006, Nucleic Acids Res..

[47]  Richard Y. Wang,et al.  Information Quality (Advances in Management Information Systems) , 2005 .

[48]  Felix Naumann,et al.  Data Quality in Genome Databases , 2003, ICIQ.

[49]  Holger Fröhlich,et al.  GOSim – an R-package for computation of information theoretic GO similarities between terms and gene products , 2007, BMC Bioinformatics.

[50]  H. Mewes,et al.  SNAPping up functionally related genes based on context information: a colinearity-free approach. , 2001, Journal of molecular biology.

[51]  Bohdan Schneider,et al.  A Biocurator Perspective: Annotation at the Research Collaboratory for Structural Bioinformatics Protein Data Bank , 2006, PLoS Comput. Biol..

[52]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[53]  John A. Hamilton,et al.  The TIGR Rice Genome Annotation Resource: improvements and new features , 2006, Nucleic Acids Res..

[54]  W. John MacMullen,et al.  Contextual analysis of variation and quality in human-curated gene ontology annotations , 2007 .

[55]  Besiki Stvilia A workbench for information quality evaluation , 2008, JCDL '08.

[56]  S. Salzberg Genome re-annotation: a wiki solution? , 2007, Genome Biology.

[57]  Elaine Toms,et al.  Developing a protocol for bioinformatics analysis: An integrated information behavior and task analysis approach , 2005, J. Assoc. Inf. Sci. Technol..

[58]  M. Ashburner,et al.  Calling on a million minds for community annotation in WikiProteins , 2008, Genome Biology.

[59]  Soo Young Rieh Judgment of information quality and cognitive authority in the Web , 2002, J. Assoc. Inf. Sci. Technol..

[60]  H. Merisalo-Rantanen,et al.  Gathering innovative end-user feedback for continuous development of information systems: a repeatable and transferable e-collaboration process , 2005, IEEE Transactions on Professional Communication.

[61]  Mary Beth Rosson,et al.  Scenario-based design , 2002 .

[62]  Mouzhi Ge,et al.  A Review of Information Quality Research - Develop a Research Agenda , 2007, ICIQ.