MAD Skills: New Analysis Practices for Big Data

As massive data acquisition and storage becomes increasingly affordable, a wide variety of enterprises are employing statisticians to engage in sophisticated data analysis. In this paper we highlight the emerging practice of Magnetic, Agile, Deep (MAD) data analysis as a radical departure from traditional Enterprise Data Warehouses and Business Intelligence. We present our design philosophy, techniques and experience providing MAD analytics for one of the world's largest advertising networks at Fox Audience Network, using the Greenplum parallel database system. We describe database design methodologies that support the agile working style of analysts in these settings. We present dataparallel algorithms for sophisticated statistical techniques, with a focus on density methods. Finally, we reflect on database system features that enable agile design and flexible algorithm development using both SQL and MapReduce interfaces over a variety of storage mechanisms.

[1]  Werner A. Stahel,et al.  Robust Statistics: The Approach Based on Influence Functions , 1987 .

[2]  Michael Stonebraker,et al.  Inclusion of new types in relational data base systems , 1986, 1986 IEEE Second International Conference on Data Engineering.

[3]  D. Ruppert Robust Statistics: The Approach Based on Influence Functions , 1987 .

[4]  Goetz Graefe,et al.  Encapsulation of parallelism in the Volcano query processing system , 1990, SIGMOD '90.

[5]  Robert Barnes,et al.  Loading databases using dataflow parallelism , 1994, SGMD.

[6]  James Demmel,et al.  ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance , 1995, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[7]  Miron Livny,et al.  Zoo: a desktop experiment management environment , 1997, SIGMOD '97.

[8]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[9]  Mary Roth,et al.  Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources , 1997, VLDB.

[10]  Stephen R. Gardner Building the data warehouse , 1998, CACM.

[11]  A. Szalay,et al.  Designing and mining multi-terabyte astronomy archives: the Sloan Digital Sky Survey , 1999, SIGMOD '00.

[12]  Mohammed J. Zaki,et al.  Large-Scale Parallel Data Mining , 2002, Lecture Notes in Computer Science.

[13]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[14]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[15]  David Maier,et al.  From databases to dataspaces: a new abstraction for information management , 2005, SGMD.

[16]  R. Steele,et al.  Optimization , 2005, Encyclopedia of Biometrics.

[17]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[18]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[19]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[20]  Avinash Kaushik,et al.  Web Analytics: An Hour a Day , 2007 .

[21]  Dan Suciu,et al.  A Case for A Collaborative Query Management System , 2009, CIDR.

[22]  Michael Stonebraker,et al.  Requirements for Science Data Bases and SciDB , 2009, CIDR.

[23]  RIOT: I/O-Efficient Numerical Computing without SQL , 2009, CIDR.