Towards efficient data search and subsetting of large-scale atmospheric datasets

Discovering the correct dataset in an efficient fashion is critical for effective simulations in the atmospheric sciences. Unlike text-based web documents, many of the large scientific datasets often contain binary encoded data that is hard to discover using popular search engines. In the atmospheric sciences, there has been a significant growth in public data hosting services. However, the ability to index and search has been limited by the metadata provided by the data host. We have developed an infrastructure-Atmospheric Data Discovery System (ADDS)-that provides an efficient data discovery environment for observational datasets in the atmospheric sciences. To support complex querying capabilities, we automatically extract and index fine-grained metadata. Datasets are indexed based on periodic crawling of popular sites and also of files requested by the users. Users are allowed to access subsets of a large dataset through our data customization feature. Our focus is the overall architecture, data subsetting scheme, and a performance evaluation of our system.

[1]  M. Zupanski Maximum Likelihood Ensemble Filter: Theoretical Aspects , 2005 .

[2]  K. A. Bekiashev,et al.  World Meteorological Organization (WMO) , 1981 .

[3]  R. Purser,et al.  Three-Dimensional Variational Analysis with Spatially Inhomogeneous Covariances , 2002 .

[4]  Shrideep Pallickara,et al.  Efficient Metadata Generation to Enable Interactive Data Discovery over Large-Scale Scientific Data Collections , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[5]  Geoffrey C. Fox,et al.  Granules: A lightweight, streaming runtime for cloud computing with support, for Map-Reduce , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[6]  David E. Bernholdt,et al.  The earth system grid: enabling access to multimodel climate simulation data. , 2009 .

[7]  W Thorpe A guide to the WMO code form FM 94 BUFR , 1995 .

[8]  G. K. Rutledge,et al.  NOMADS A Climate and Weather Model Archive at the National Oceanic and Atmospheric Administration , 2006 .

[9]  Yi Huang,et al.  Cooperating services for data-driven computational experimentation , 2005, Computing in Science & Engineering.

[10]  Tom Ross,et al.  RESOURCES - NOAA's Climate Database Modernization Program: Rescuing, Archiving, and Digitizing History , 2007 .

[11]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Geoffrey C. Fox,et al.  An Overview of the Granules Runtime for Cloud Computing , 2008, 2008 IEEE Fourth International Conference on eScience.

[14]  Ben Domenico,et al.  Thematic Real-time Environmental Distributed Data Services (THREDDS): Incorporating Interactive Analysis Tools into NSDL , 2002, J. Digit. Inf..

[15]  I. Foster,et al.  Enabling worldwide access to climate simulation data: the earth system grid (ESG) , 2006 .

[16]  Russ Rew,et al.  NetCDF: an interface for scientific data access , 1990, IEEE Computer Graphics and Applications.