Jaql

This paper describes Jaql, a declarative scripting language for analyzing large semistructured datasets in parallel using Hadoop’s MapReduce framework. Jaql is currently used in IBM’s InfoSphere BigInsights [5] and Cognos Consumer Insight [9] products. Jaql’s design features are: (1) a flexible data model, (2) reusability, (3) varying levels of abstraction, and (4) scalability. Jaql’s data model is inspired by JSON and can be used to represent datasets that vary from flat, relational tables to collections of semistructured documents. A Jaql script can start without any schema and evolve over time from a partial to a rigid schema. Reusability is provided through the use of higher-order functions and by packaging related functions into modules. Most Jaql scripts work at a high level of abstraction for concise specification of logical operations (e.g., join), but Jaql’s notion of physical transparency also provides a lower level of abstraction if necessary. This allows users to pin down the evaluation plan of a script for greater control or even add new operators. The Jaql compiler automatically rewrites Jaql scripts so they can run in parallel on Hadoop. In addition to describing Jaql’s design, we present the results of scale-up experiments on Hadoop running Jaql scripts for intranet data analysis and log processing.

[1]  Michael Stonebraker,et al.  The Case for Shared Nothing , 1985, HPTS.

[2]  David J. DeWitt,et al.  GAMMA - A High Performance Dataflow Database Machine , 1986, VLDB.

[3]  Guy E. Blelloch,et al.  NESL: A Nested Data-Parallel Language , 1992 .

[4]  Rick Greer,et al.  Daytona and the fourth-generation language Cymbal , 1999, SIGMOD '99.

[5]  Bernhard Mitschang,et al.  User-Defined Table Operators: Enhancing Extensibility for ORDBMS , 1999, VLDB.

[6]  Rudolf Eigenmann,et al.  Cetus - An Extensible Compiler Infrastructure for Source-to-Source Transformation , 2003, LCPC.

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[9]  Matthias Nicola,et al.  An XML transaction processing benchmark , 2007, SIGMOD '07.

[10]  Scott Boag,et al.  XQuery 1.0 : An XML Query Language , 2007 .

[11]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[12]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[13]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[14]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[15]  Ling Wang,et al.  XQuery Rewrite Optimization in IBM DB2 pureXML , 2008, IEEE Data Eng. Bull..

[16]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[17]  Joseph M. Hellerstein,et al.  MAD Skills: New Analysis Practices for Big Data , 2009, Proc. VLDB Endow..

[18]  Frederick Reiss,et al.  Towards a Scalable Enterprise Content Analytics Platform , 2009, IEEE Data Eng. Bull..

[19]  Michele Colajanni,et al.  Defending financial infrastructures through early warning systems: the intelligence cloud approach , 2009, CSIIRW '09.

[20]  Peter J. Haas,et al.  E = MC3: managing uncertain enterprise data in a cluster-computing environment , 2009, SIGMOD Conference.

[21]  John Cieslewicz,et al.  SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions , 2009, Proc. VLDB Endow..

[22]  Jens Dittrich,et al.  iMeMex: From Search to Information Integration and Back , 2009, IEEE Data Eng. Bull..

[23]  Frederick Reiss,et al.  SystemT: a system for declarative information extraction , 2009, SGMD.

[24]  Calvin Lin,et al.  Midas for government: Integration of government spending data on Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[25]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[26]  Craig Chambers,et al.  FlumeJava: easy, efficient data-parallel pipelines , 2010, PLDI '10.

[27]  Dominic Battré,et al.  Nephele/PACTs: a programming model and execution framework for web-scale analytical processing , 2010, SoCC '10.

[28]  Peter J. Haas,et al.  Ricardo: integrating R and Hadoop , 2010, SIGMOD Conference.

[29]  Rajasekar Krishnamurthy,et al.  Midas: integrating public financial data , 2010, SIGMOD Conference.

[30]  Alin Deutsch,et al.  ASTERIX: towards a scalable, semistructured data platform for evolving-world models , 2011, Distributed and Parallel Databases.

[31]  Rares Vernica,et al.  Hyracks: A flexible and extensible foundation for data-intensive computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.