Monitoring data quality in Kepler

Data quality is an important component of modern scientific discovery. Many scientific discovery processes consume data from a diverse array of resources such as streaming sensor networks, web services, and databases. The validity of a scientific computation's results is highly dependent on the quality of these input data. Scientific workflow systems are being increasingly used to automate scientific computations by facilitating experiment design, data capture, integration, processing, and analysis. These workflows may execute in grid or cloud environments, and if the data produced during workflow execution is deemed unusable or low in quality, execution should stop to prevent wasting these valuable resources. We propose an approach in the Kepler scientific workflow system for monitoring data quality and demonstrate its use for oceanography and bioinformatics domains.