Meteor/Sopremo: An Extensible Query Language and Operator Model

Recently, quite a few query and scripting languages for MapReduce-based systems have been developed to ease formulating complex data analysis tasks. However, existing tools mainly provide basic operators for rather simple analyses, such as aggregating or ltering. Analytic functionality for advanced applications, such as data cleansing or information extraction can only be embedded in user-dened functions where the semantics is hidden from the query compiler and optimizer. In this paper, we present a language that treats application-specic functions as rst-class operators, so that operator semantics can be evaluated and exploited for optimization at compile time. We present Sopremo, a semantically rich operator model, and Meteor, an extensible query language that is grounded in Sopremo. Sopremo also provides a programming framework that allows users to easily develop and integrate extensions with their respective operators and instantiations. Meteor’s syntax is operator-oriented and uses a Json-like data model to support applications that analyze semi- and unstructured data. Meteor queries are translated into data ow programs of operator instantiations, i.e., concrete implementations of the involved Sopremo operators. Using a real-world example, we show how operators from dierent applications can be combined for writing complex analytical queries.

[1]  Joseph T. Kider,et al.  All-pairs shortest-paths for large graphs on the GPU , 2008, GH '08.

[2]  Dominic Battré,et al.  Massively parallel data analysis with PACTs on Nephele , 2010, Proc. VLDB Endow..

[3]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[4]  Dominic Battré,et al.  Nephele/PACTs: a programming model and execution framework for web-scale analytical processing , 2010, SoCC '10.

[5]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[6]  Odej Kao,et al.  Nephele: efficient parallel data processing in the cloud , 2009, MTAGS '09.

[7]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[8]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[9]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[10]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[11]  ReedBenjamin,et al.  Building a high-level dataflow system on top of Map-Reduce , 2009, VLDB 2009.

[12]  Rares Vernica,et al.  Hyracks: A flexible and extensible foundation for data-intensive computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[13]  Dennis Shasha,et al.  AJAX: an extensible data cleaning tool , 2000, SIGMOD '00.

[14]  Andrey Balmin,et al.  Jaql , 2011, Proc. VLDB Endow..

[15]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[18]  Felix Naumann,et al.  Declarative Data Fusion - Syntax, Semantics, and Implementation , 2005, ADBIS.

[19]  Rares Vernica,et al.  Flexible and Extensible Foundation for Data- Intensive Computing , 2011 .

[20]  Frederick Reiss,et al.  Towards a Scalable Enterprise Content Analytics Platform , 2009, IEEE Data Eng. Bull..

[21]  Alexandros Labrinidis,et al.  Challenges and Opportunities with Big Data , 2012, Proc. VLDB Endow..

[22]  Frederick Reiss,et al.  SystemT: An Algebraic Approach to Declarative Information Extraction , 2010, ACL.

[23]  Astrid Rheinländer,et al.  Opening the Black Boxes in Data Flow Optimization , 2012, Proc. VLDB Endow..

[24]  Alin Deutsch,et al.  ASTERIX: towards a scalable, semistructured data platform for evolving-world models , 2011, Distributed and Parallel Databases.

[25]  Brendan T. O'Connor,et al.  From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series , 2010, ICWSM.

[26]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[27]  Ioana Manolescu,et al.  Declarative XML Data Cleaning with XClean , 2007, CAiSE.

[28]  Felix Naumann,et al.  Integrating open government data with stratosphere for more transparency , 2012, J. Web Semant..

[29]  Jens Dittrich,et al.  iMeMex: From Search to Information Integration and Back , 2009, IEEE Data Eng. Bull..