The Myria Big Data Management and Analytics System and Cloud Services

In this paper, we present an overview of the Myria stack for big data management and analytics that we developed in the database group at the University of Washington and that we have been operating as a cloud service aimed at domain scientists around the UW campus. We highlight Myria’s key design choices and innovations and report on our experience with using Myria for various data science use-cases.

[1]  Dan Suciu,et al.  From Theory to Practice: Efficient Join Query Evaluation in a Parallel Database System , 2015, SIGMOD Conference.

[2]  Magdalena Balazinska,et al.  Asynchronous and Fault-Tolerant Recursive Datalog Evaluation in Shared-Nothing Engines , 2015, Proc. VLDB Endow..

[3]  Yannis Papakonstantinou,et al.  The SQL++ Query Language: Configurable, Unifying and Semi-structured , 2014, 1405.3631.

[4]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[5]  Chen Li,et al.  AsterixDB: A Scalable, Open Source BDMS , 2014, Proc. VLDB Endow..

[6]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[7]  John R. Gilbert,et al.  The Combinatorial BLAS: design, implementation, and applications , 2011, Int. J. High Perform. Comput. Appl..

[8]  Stanley B. Zdonik,et al.  Query Steering for Interactive Data Exploration , 2013, CIDR.

[9]  Herodotos Herodotou,et al.  No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics , 2011, SoCC.

[10]  Magdalena Balazinska,et al.  PerfEnforce Demonstration: Data Analytics with Performance Guarantees , 2016, SIGMOD Conference.

[11]  Michael D. Ernst,et al.  The HaLoop approach to large-scale iterative data analysis , 2012, The VLDB Journal.

[12]  David J. DeWitt,et al.  Data page layouts for relational databases on deep memory hierarchies , 2002, The VLDB Journal.

[13]  Rizal Setya Perdana What is Twitter , 2013 .

[14]  Jeffrey D. Ullman,et al.  Optimizing joins in a map-reduce environment , 2010, EDBT '10.

[15]  Michael Stonebraker,et al.  The BigDAWG Polystore System , 2015, SGMD.

[16]  Goetz Graefe,et al.  Volcano - An Extensible and Parallel Query Evaluation System , 1994, IEEE Trans. Knowl. Data Eng..

[17]  Ashish Motivala,et al.  The Snowflake Elastic Data Warehouse , 2016, SIGMOD Conference.

[18]  Magdalena Balazinska,et al.  Big-Data Management Use-Case: A Cloud Service for Creating and Analyzing Galactic Merger Trees , 2014, DanaC'14.

[19]  Paul G. Brown,et al.  Overview of sciDB: large scale array storage, processing and analysis , 2010, SIGMOD Conference.

[20]  Martin L. Kersten,et al.  MonetDB: Two Decades of Research in Column-oriented Database Architectures , 2012, IEEE Data Eng. Bull..

[21]  Magdalena Balazinska,et al.  Toward elastic memory management for cloud data analytics , 2016, BeyondMR@SIGMOD.

[22]  Brandon Myers,et al.  High-performance parallel systems for data-intensive computing , 2016 .

[23]  Jeremy Freeman,et al.  Technologies for imaging neural activity in large volumes , 2016, Nature Neuroscience.

[24]  Magdalena Balazinska,et al.  Gaussian Mixture Models Use-Case: In-Memory Analysis with Myria , 2015, IMDM@VLDB.

[25]  Magdalena Balazinska,et al.  Changing the Face of Database Cloud Services with Personalized Service Level Agreements , 2015, CIDR.

[26]  Martin Grund,et al.  Impala: A Modern, Open-Source SQL Engine for Hadoop , 2015, CIDR.

[27]  Todd L. Veldhuizen,et al.  Leapfrog Triejoin: a worst-case optimal join algorithm , 2012, ArXiv.

[28]  Carlos Ordonez,et al.  Efficient computation of PCA with SVD in SQL , 2009, DMMT '09.

[29]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[30]  Yannis Papakonstantinou,et al.  FORWARD: Data-Centric UIs using Declarative Templates that Efficiently Wrap Third-Party JavaScript Components , 2014, Proc. VLDB Endow..

[31]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[32]  J. Peacock,et al.  Simulations of the formation, evolution and clustering of galaxies and quasars , 2005, Nature.

[33]  Jeffrey Heer,et al.  Perfopticon: Visual Query Analysis for Distributed Databases , 2015, Comput. Graph. Forum.

[34]  Jignesh M. Patel,et al.  When Free Is Not Really Free: What Does It Cost to Run a Database Workload in the Cloud? , 2011, TPCTC.

[35]  et al.,et al.  Jupyter Notebooks - a publishing format for reproducible computational workflows , 2016, ELPUB.

[36]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[37]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[38]  Dan Suciu,et al.  Demonstration of the Myria big data management service , 2014, SIGMOD Conference.

[39]  Gavin M. Bierman,et al.  Lost in translation: formalizing proposed extensions to c# , 2007, OOPSLA.

[40]  Rares Vernica,et al.  Hyracks: A flexible and extensible foundation for data-intensive computing , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[41]  Alvin Cheung,et al.  Comparative Evaluation of Big-Data Systems on Scientific Image Analytics Workloads , 2016, Proc. VLDB Endow..

[42]  BalazinskaMagdalena,et al.  Comparative evaluation of big-data systems on scientific image analytics workloads , 2017, VLDB 2017.

[43]  Steven Hand,et al.  Musketeer: all for one, one for all in data processing systems , 2015, EuroSys.

[44]  Antony I. T. Rowstron,et al.  Bridging the tenant-provider gap in cloud services , 2012, SoCC '12.

[45]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[46]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[47]  Essa Yacoub,et al.  The WU-Minn Human Connectome Project: An overview , 2013, NeuroImage.

[48]  James J James All For One, One For All , 2015, Disaster Medicine and Public Health Preparedness.

[49]  Alvin Cheung,et al.  PipeGen: Data Pipe Generator for Hybrid Analytics , 2016, SoCC.

[50]  Yoav Goldberg,et al.  A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books , 2013, *SEMEVAL.