Managing scientific data: lessons, challenges, and opportunities

Today's scientific processes heavily depend on fast and accurate analysis of experimental data. Scientists are routinely overwhelmed by the effort needed to manage the volumes of data produced either by observing phenomena or by sophisticated simulations. As database systems have proven inefficient, inadequate, or insufficient to meet the needs of scientific applications, the scientific community typically uses special-purpose legacy software. When compared to a general-purpose DBMS, however, application-specific systems require more resources to maintain, and in order to achieve acceptable performance they often sacrifice data independence and hinder the reuse of knowledge. Nowadays, scientific datasets are growing at unprecedented rates, a result of increasing complexity of the simulated models and ever-improving instrument precision; consequently, scientists' queries become more sophisticated as they try to interpret the data correctly. Datasets and scientific query complexity are likely to continue to grow indefinitely, rendering legacy systems increasingly inadequate. To respond to the challenge, the data management community aspires to solve scientific data management problems by carefully examining the problems of scientific applications and by developing special- or general-purpose scientific data management techniques and systems. This talk discusses the work of teams around the world in an effort to surface the most critical requirements of such an undertaking, and the technological innovations needed to satisfy them.