Towards unifying spreadsheets with databases for ad-hoc interactive data management at scale

We are witnessing the increasing availability of data across a spectrum of domains, necessitating the interactive ad-hoc management and analysis of this data, in order to put it to use. Unfortunately, interactive ad-hoc management of very large datasets presents a host of challenges, ranging from performance to interface usability. This thesis introduces a new research direction of manipulation of large datasets using an interactive interface and makes several steps towards this direction. In particular, we develop DataSpread, a tool that enables users to work with arbitrary large datasets via a direct manipulation interface. DataSpread holistically unifies spreadsheets and relational databases to leverage the benefits of both. However, this holistic integration is not trivial due to the differences in the architecture and ideologies of the two paradigms: spreadsheets and databases. We have built a prototype of DataSpread, which, in addition to motivating the underlying challenges, demonstrates the feasibility and usefulness of this holistic integration. We focus on the following challenges encountered while developing DataSpread. (i) Representation—here, we address the challenges of flexibly representing ad-hoc spreadsheet data within a relational database; (ii) Indexing—here, we develop indexing data structures for supporting and maintaining access by position; (iii) Formula Computation—here, we introduce an asynchronous formula computation framework that addresses the challenge of ensuring consistency and interactivity at the same time; and (iv) Organization—here, we develop a framework to best organize data based on a workload, e.g., queries specified on the spreadsheet interface.

[1]  Kevin Chen-Chuan Chang,et al.  Characterizing Scalability Issues in Spreadsheet Software using Online Forums , 2018, CHI Extended Abstracts.

[2]  Pat Hanrahan,et al.  Polaris: a system for query, analysis and visualization of multi-dimensional relational databases , 2000, IEEE Symposium on Information Visualization 2000. INFOVIS 2000. Proceedings.

[3]  Bertrand M. T. Lin,et al.  Single-machine scheduling with supporting tasks , 2015, Discret. Optim..

[4]  Nisheeth Shrivastava,et al.  Graph summarization with bounded error , 2008, SIGMOD Conference.

[5]  Alexander Zeier,et al.  HYRISE - A Main Memory Hybrid Storage Engine , 2010, Proc. VLDB Endow..

[6]  G. Lawrence Sanders,et al.  Denormalization effects on performance of RDBMS , 2001, Proceedings of the 34th Annual Hawaii International Conference on System Sciences.

[7]  Eirik Bakke,et al.  The Schema-Independent Database UI: A Proposed Holy Grail and Some Suggestions , 2011, CIDR.

[8]  Daniel J. Abadi,et al.  Scalable Pattern Matching over Compressed Graphs via Dedensification , 2016, KDD.

[9]  Jonathan W. Palmer,et al.  Web Site Usability, Design, and Performance Metrics , 2002, Inf. Syst. Res..

[10]  Jeffrey Heer,et al.  Wrangler: interactive visual specification of data transformation scripts , 2011, CHI.

[11]  Peter Sestoft Spreadsheet Implementation Technology: Basics and Extensions , 2014 .

[12]  Catriel Beeri,et al.  Equivalence of Relational Database Schemes , 1981, SIAM J. Comput..

[13]  Dan Suciu,et al.  SnipSuggest: Context-Aware Autocompletion for SQL , 2010, Proc. VLDB Endow..

[14]  Joseph C. Culberson,et al.  Covering polygons is hard , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[15]  Li Qian,et al.  CRIUS: User-Friendly Database Design , 2010, Proc. VLDB Endow..

[16]  Stanley B. Zdonik,et al.  Query Steering for Interactive Data Exploration , 2013, CIDR.

[17]  Danai Koutra,et al.  Graph Summarization Methods and Applications , 2016, ACM Comput. Surv..

[18]  Alexander Zeier,et al.  A Hybrid Row-Column OLTP Database Architecture for Operational Reporting , 2008, BIRTE.

[19]  Marios Hadjieleftheriou,et al.  R-Trees - A Dynamic Index Structure for Spatial Searching , 2008, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.

[20]  F. E. A Relational Model of Data Large Shared Data Banks , 2000 .

[21]  David R. Karger,et al.  A spreadsheet-based user interface for managing plural relationships in structured data , 2011, CHI.

[22]  Surajit Chaudhuri,et al.  Automated Selection of Materialized Views and Indexes in SQL Databases , 2000, VLDB.

[23]  Eirik Bakke,et al.  Expressive Query Construction through Direct Manipulation of Nested Relational Results , 2016, SIGMOD Conference.

[24]  Cláudio T. Silva,et al.  VisTrails: visualization meets data management , 2006, SIGMOD Conference.

[25]  Liwen Sun,et al.  Skipping-oriented Partitioning for Columnar Layouts , 2016, Proc. VLDB Endow..

[26]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[27]  Surajit Chaudhuri,et al.  To tune or not to tune?: a lightweight physical design alerter , 2006, VLDB.

[28]  Zhe Chen,et al.  Automatic web spreadsheet data extraction , 2013, SS@ '13.

[29]  Jácome Cunha,et al.  From spreadsheets to relational databases and back , 2009, PEPM '09.

[30]  Jennifer Widom,et al.  Deco: declarative crowdsourcing , 2012, CIKM.

[31]  Lei Sheng,et al.  Query By Excel , 2005, VLDB.

[32]  Bonnie A. Nardi,et al.  The spreadsheet interface: A basis for end user programming , 1990, IFIP TC13 International Conference on Human-Computer Interaction.

[33]  T. J. Teorey,et al.  A logical design methodology for relational databases using the extended entity-relationship model , 1986, CSUR.

[34]  Shamkant B. Navathe,et al.  Vertical partitioning for database design: a graphical algorithm , 1989, SIGMOD '89.

[35]  W. W. Armstrong,et al.  Dependency Structures of Data Base Relationships , 1974, IFIP Congress.

[36]  Bin Liu,et al.  A Spreadsheet Algebra for a Direct Data Manipulation Query Interface , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[37]  Christian S. Jensen,et al.  Google fusion tables: web-centered data management and collaboration , 2010, SIGMOD Conference.

[38]  Chun Zhang,et al.  Automating physical database design in a parallel database , 2002, SIGMOD '02.

[39]  Ramakrishna Varadarajan,et al.  The Vertica Analytic Database: C-Store 7 Years Later , 2012, Proc. VLDB Endow..

[40]  Philip A. Bernstein,et al.  A unified approach to functional dependencies and relations , 1975, SIGMOD '75.

[41]  Ronitt Rubinfeld,et al.  I've Seen "Enough": Incrementally Improving Visualizations to Support Rapid Decision Making , 2017, Proc. VLDB Endow..

[42]  Anastasia Ailamaki,et al.  AutoPart: automating schema design for large scientific databases using data partitioning , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[43]  Zhe Chen,et al.  Senbazuru: A Prototype Spreadsheet Database Management System , 2013, Proc. VLDB Endow..

[44]  Catriel Beeri,et al.  A Sophisticate's Introduction to Database Normalization Theory , 1978, VLDB.

[45]  Jennifer Widom,et al.  The Lowell database research self-assessment , 2003, CACM.

[46]  Qing Chen,et al.  Graph Stream Summarization: From Big Bang to Big Crunch , 2016, SIGMOD Conference.

[47]  Tiziana Catarci,et al.  Visual Query Systems for Databases: A Survey , 1997, J. Vis. Lang. Comput..

[48]  Jerzy Tyszkiewicz Spreadsheet as a relational database engine , 2010, SIGMOD Conference.

[49]  Martin L. Kersten,et al.  Database Cracking , 2007, CIDR.

[50]  Bonnie A. Nardi,et al.  An ethnographic study of distributed problem solving in spreadsheet development , 1990, CSCW '90.

[51]  Raymond R. Panko,et al.  What we know about spreadsheet errors , 1998 .

[52]  Abraham Silberschatz,et al.  DataPlay: interactive tweaking and example-driven correction of graphical database queries , 2012, UIST.

[53]  Donald D. Chamberlin,et al.  Access Path Selection in a Relational Database Management System , 1989 .

[54]  Richard Hull Relative Information Capacity of Simple Relational Database Schemata , 1986, SIAM J. Comput..

[55]  Stratos Idreos,et al.  dbTouch: Analytics at your Fingertips , 2013, CIDR.

[56]  Abhinav Gupta,et al.  Advanced SQL modeling in RDBMS , 2005, TODS.

[57]  Cong Yu,et al.  Schema summarization , 2006, VLDB.

[58]  Reza Barkhi,et al.  Framework for Cognitive Skill Acquisition and Spreadsheet Training , 2003, J. Organ. End User Comput..

[59]  Joseph M. Hellerstein,et al.  Online dynamic reordering , 2000, The VLDB Journal.

[60]  Surajit Chaudhuri,et al.  An Online Approach to Physical Design Tuning , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[61]  Joseph M. Hellerstein,et al.  Online Dynamic Reordering for Interactive Data Processing , 1999, VLDB.

[62]  Yiming Yang,et al.  Introducing the Enron Corpus , 2004, CEAS.

[63]  Ronald Fagin The Decomposition Versus Synthetic Approach to Relational Database Design , 1977, VLDB.

[64]  Zhou Wei,et al.  Service-oriented data denormalization for scalable web applications , 2008, WWW.

[65]  Stephen G. Powell,et al.  A critical review of the literature on spreadsheet errors , 2008, Decis. Support Syst..

[66]  H. V. Jagadish,et al.  Skimmer: rapid scrolling of relational query results , 2012, SIGMOD Conference.

[67]  Fei Li,et al.  Constructing an Interactive Natural Language Interface for Relational Databases , 2014, Proc. VLDB Endow..

[68]  Dan Suciu,et al.  A Case for A Collaborative Query Management System , 2009, CIDR.

[69]  E. Lawler Sequencing Jobs to Minimize Total Weighted Completion Time Subject to Precedence Constraints , 1978 .

[70]  Elvis C. Foster,et al.  Integrity Rules and Normalization , 2014 .

[71]  Ben Shneiderman,et al.  Direct Manipulation: A Step Beyond Programming Languages , 1983, Computer.

[72]  H. V. Jagadish,et al.  Guided Interaction: Rethinking the Query-Result Paradigm , 2011, Proc. VLDB Endow..

[73]  Arnab Nandi,et al.  Querying Without Keyboards , 2013, CIDR.

[74]  Sumit Gulwani,et al.  Automating string processing in spreadsheets using input-output examples , 2011, POPL '11.

[75]  Surajit Chaudhuri,et al.  Maintenance of Materialized Views: Problems, Techniques, and Applications. , 1995 .

[76]  Arnab Nandi,et al.  Gestural Query Specification , 2013, Proc. VLDB Endow..

[77]  E. F. Codd,et al.  Recent Investigations in Relational Data Base Systems , 1974, ACM Pacific.

[78]  Ming Li,et al.  Clustering by compression , 2003, IEEE International Symposium on Information Theory, 2003. Proceedings..

[79]  Pat Hanrahan,et al.  Show Me: Automatic Presentation for Visual Analysis , 2007, IEEE Transactions on Visualization and Computer Graphics.

[80]  Jennifer Widom,et al.  Database Systems: The Complete Book , 2001 .

[81]  Philip A. Bernstein,et al.  Synthesizing third normal form relations from functional dependencies , 1976, TODS.

[82]  Vivek R. Narasayya,et al.  Integrating vertical and horizontal partitioning into automated physical database design , 2004, SIGMOD '04.

[83]  Ben Shneiderman,et al.  Improving the human factors aspect of database interactions , 1978, TODS.

[84]  Michael Stonebraker,et al.  C-Store: A Column-oriented DBMS , 2005, VLDB.

[85]  Adriane Chapman,et al.  Making database systems usable , 2007, SIGMOD '07.

[86]  Jeffrey Heer,et al.  The Effects of Interactive Latency on Exploratory Visual Analysis , 2014, IEEE Transactions on Visualization and Computer Graphics.

[87]  Stavros Papadopoulos,et al.  The TileDB Array Data Storage Manager , 2016, Proc. VLDB Endow..

[88]  C. P. Wang,et al.  Segment Synthesis in Logical Data Base Design , 1975, IBM J. Res. Dev..

[89]  Ronald Fagin,et al.  Normal forms and relational database operators , 1979, SIGMOD '79.

[90]  Benjamin B. Bederson,et al.  A review of overview+detail, zooming, and focus+context interfaces , 2009, CSUR.

[91]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[92]  James D. Hollan,et al.  Direct Manipulation Interfaces , 1985, Hum. Comput. Interact..

[93]  Paul G. Brown,et al.  Overview of sciDB: large scale array storage, processing and analysis , 2010, SIGMOD Conference.

[94]  Ronald Fagin,et al.  Multivalued dependencies and a new normal form for relational databases , 1977, TODS.

[95]  Surajit Chaudhuri,et al.  An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server , 1997, VLDB.

[96]  Burton Grad,et al.  The Creation and the Demise of VisiCalc , 2007, IEEE Annals of the History of Computing.

[97]  Joachim Biskup,et al.  Synthesizing independent database schemas , 1979, SIGMOD '79.

[98]  Rob Miller,et al.  Crowdsourced Databases: Query Processing with People , 2011, CIDR.

[99]  Alfred V. Aho,et al.  The theory of joins in relational data bases , 1977, 18th Annual Symposium on Foundations of Computer Science (sfcs 1977).

[100]  Carlo Batini,et al.  Inclusion and Equivalence between Relational Database Schemata , 1982, Theor. Comput. Sci..

[101]  Sumit Gulwani,et al.  FlashExtract: a framework for data extraction by examples , 2014, PLDI.

[102]  E. F. Codd,et al.  Further Normalization of the Data Base Relational Model , 1971, Research Report / RJ / IBM / San Jose, California.