Curating for Quality: Ensuring Data Quality to Enable New Science

Science is built on observations. If our observational data is bad, we are building a house on sand. Some of our data banks have quality measurements and maintenance, such as the National Climate Data Center and the National Center for Biotechnology Information; but others do not, and we do not even know which scientific data services have quality metrics or what they are. Data quality is an assertion about data properties, typically assumed within a context defined by a collection that holds the data. The assertion is made by the creator of the data. The collection context includes both metadata that describe provenance and representation information, and procedures that are able to parse and manipulate the data. However data quality from the perspective of users is defined based on the data properties that are required for use within their scientific research. The user believes data is of high quality when assertions about compliance can be shown to their research requirements. Digital data can accumulate rich contextual and derivative data as it is collected, analyzed, used, and reused, and planning for the management of this history requires new kinds of tools, techniques, standards, workflows, and attitudes. As science and industry recognize the need for digital curation, scientists and information professionals recognize that access and use of data depend on trust in the accuracy and veracity of data. In all data sets trust and reuse depend on accessible context and metadata that make explicit provenance, precision, and other traces of the datum and data life cycle. Poor data quality can be worse than missing data because it can waste resources and lead to faulty ideas and solutions, or at minimum challenges trust in the results and implications drawn from the data. Improvement in data quality can thus have significant benefits.