When big data leads to lost data

For decades, scientists bemoaned the scarcity of observational data to analyze and against which to test their models. Exponential growth in data volumes from ever-cheaper environmental sensors has provided scientists with the answer to their prayers: "big data". Now, scientists face a new challenge: with terabytes, petabytes or exabytes of data at hand, stored in thousands of heterogeneous datasets, how can scientists find the datasets most relevant to their research interests? If they cannot find the data, then they may as well never have collected it; that data is lost to them. Our research addresses this challenge, using an existing scientific archive as our test-bed. We approach this problem in a new way: by adapting Information Retrieval techniques, developed for searching text documents, into the world of (primarily numeric) scientific data. We propose an approach that uses a blend of automated and "semi-curated" methods to extract metadata from large archives of scientific data. We then perform searches over the extracted metadata, returning results ranked by similarity to the query terms. We briefly describe an implementation performed at an ocean observatory to validate the proposed approach. We propose performance and scalability research to explore how continued archive growth will affect our goal of interactive response, no matter the scale.

[1]  Gianluca Demartini,et al.  Overview of the INEX 2009 Entity Ranking Track , 2009, INEX.

[2]  Amos Tversky,et al.  Studies of similarity , 1978 .

[3]  Marianne Winslett,et al.  Scientific and Statistical Database Management, 21st International Conference, SSDBM 2009, New Orleans, LA, USA, June 2-4, 2009, Proceedings , 2009, SSDBM.

[4]  Michael F. Goodchild,et al.  Finding Geographic Information: Collection-Level Metadata , 2003, GeoInformatica.

[5]  Iadh Ounis,et al.  A case study of distributed information retrieval architectures to index one terabyte of text , 2005, Inf. Process. Manag..

[6]  Andrew Trotman,et al.  Focused Retrieval and Evaluation, 8th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2009, Brisbane, Australia, December 7-9, 2009, Revised and Selected Papers , 2010, INEX.

[7]  Mark Sanderson,et al.  The relationship between IR effectiveness measures and user satisfaction , 2007, SIGIR.

[8]  Beng Chin Ooi,et al.  Indexing multi-dimensional data in a cloud system , 2010, SIGMOD Conference.

[9]  Ramakrishnan Srikant,et al.  Searching with numbers , 2002, WWW '02.

[10]  André Skupin,et al.  SPATIAL METAPHORS FOR VISUALIZING VERY LARGE DATA ARCHIVES , 2003 .

[11]  Anne E. Trefethen,et al.  The Data Deluge: An e-Science Perspective , 2003 .

[12]  Ihab F. Ilyas,et al.  A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[13]  Torsten Suel,et al.  Optimized Query Execution in Large Search Engines with Global Page Ordering , 2003, VLDB.

[14]  Reagan Moore,et al.  Data and Metadata Collections for Scientific Applications , 2001, HPCN Europe.

[15]  Hans-Peter Kriegel,et al.  Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? , 2010, SSDBM.

[16]  Anne Aula,et al.  How does search behavior change as search becomes more difficult? , 2010, CHI.

[17]  Jayant Madhavan,et al.  Recovering Semantics of Tables on the Web , 2011, Proc. VLDB Endow..

[18]  David Maier,et al.  Finding Haystacks with Needles: Ranked Search for Data Using Geospatial and Temporal Characteristics , 2011, SSDBM.

[19]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[20]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[21]  Michael F. Goodchild,et al.  Defining a Digital Earth System , 2008, Trans. GIS.

[22]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Evaluation , 2000, TREC.

[23]  Ramakrishnan Srikant,et al.  Searching with Numbers , 2003, IEEE Trans. Knowl. Data Eng..

[24]  Louise T. Su The Relevance of Recall and Precision in User Evaluation , 1994, J. Am. Soc. Inf. Sci..

[25]  D. R. Montello,et al.  The Distance–Similarity Metaphor in Network-Display Spatializations , 2004 .

[26]  Eugene Agichtein,et al.  Find it if you can: a game for modeling different types of web search success using interaction data , 2011, SIGIR.

[27]  G. Lakoff,et al.  Where Mathematics Comes From , 2000 .

[28]  Shrideep Pallickara,et al.  Efficient Metadata Generation to Enable Interactive Data Discovery over Large-Scale Scientific Data Collections , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[29]  David Maier,et al.  Navigating Oceans of Data , 2012, SSDBM.

[30]  Gerard Salton,et al.  Automatic Information Organization And Retrieval , 1968 .

[31]  Gerhard Weikum,et al.  Integrating DB and IR Technologies: What is the Sound of One Hand Clapping? , 2005, CIDR.

[32]  Hector Garcia-Molina,et al.  Performance of inverted indices in shared-nothing distributed text document information retrieval systems , 1993, [1993] Proceedings of the Second International Conference on Parallel and Distributed Information Systems.

[33]  Patrizia Grifoni,et al.  Approximating Geographical Queries , 2009, Journal of Computer Science and Technology.

[34]  Daniel R. Montello,et al.  The measurement of cognitive distance: Methods and construct validity , 1991 .