Manifesto from Dagstuhl Perspectives Workshop 16151 Research Directions for Principles of Data Management

The area of Principles of Data Management (PDM) has made crucial contributions to the development of formal frameworks for understanding and managing data and knowledge. This work has involved a rich cross-fertilization between PDM and other disciplines in mathematics and computer science, including logic, complexity theory, and knowledge representation. We anticipate on-going expansion of PDM research as the technology and applications involving data management continue to grow and evolve. In particular, the lifecycle of Big Data Analytics raises a wealth of challenge areas that PDM can help with. In this report we identify some of the most important research directions where the PDM community has the potential to make significant contributions. This is done from three perspectives: potential practical relevance, results already obtained, and research questions that appear surmountable in the short and medium term. Perspectives Workshop April 10–15, 2016 – http://www.dagstuhl.de/16151 2012 ACM Subject Classification Theory of computation → Database theory

[1]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[2]  Thomas Lukasiewicz,et al.  Generalized Consistent Query Answering under Existential Rules , 2016, KR.

[3]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[4]  Evgenij Thorstensen,et al.  Mapping Analysis in Ontology-based Data Access: Algorithms and Complexity (Extended Abstract) , 2015, Description Logics.

[5]  Dan Suciu,et al.  From Theory to Practice: Efficient Join Query Evaluation in a Parallel Database System , 2015, SIGMOD Conference.

[6]  Peter J. Haas,et al.  Simulation of database-valued markov chains using SimSQL , 2013, SIGMOD '13.

[7]  Erhard Rahm,et al.  The Scholarly Impact of CLEF (2000-2009) , 2013, CLEF.

[8]  R. Hull,et al.  Automatic verification of database-centric systems , 2014, SIGL.

[9]  Jeffrey F. Naughton,et al.  Declarative Information Extraction Using Datalog with Embedded Extraction Predicates , 2007, VLDB.

[10]  Richard Hull,et al.  Data Centric BPM and the Emerging Case Management Standard: A Short Survey , 2012, Business Process Management Workshops.

[11]  Jon Feldman,et al.  On distributing symmetric streaming computations , 2008, SODA '08.

[12]  C. Buckley,et al.  Evaluating Evaluation Measure Stability , 2000, SIGIR Forum.

[13]  Leopoldo E. Bertossi,et al.  Database Repairing and Consistent Query Answering , 2011, Database Repairing and Consistent Query Answering.

[14]  Fernando Diaz,et al.  Temporal profiles of queries , 2007, TOIS.

[15]  Diego Calvanese,et al.  Tractable Reasoning and Efficient Query Answering in Description Logics: The DL-Lite Family , 2007, Journal of Automated Reasoning.

[16]  Stephen E. Toulmin,et al.  The Uses of Argument, Updated Edition , 2008 .

[17]  Nicola Ferro,et al.  Reproducibility Challenges in Information Retrieval Evaluation , 2017, ACM J. Data Inf. Qual..

[18]  Eyke Hüllermeier,et al.  Bayes Optimal Multilabel Classification via Probabilistic Classifier Chains , 2010, ICML.

[19]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[20]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Markus Zanker,et al.  rrecsys: An R-package for Prototyping Recommendation Algorithms , 2016, RecSys Posters.

[22]  James Mayfield,et al.  Comparing cross-language query expansion techniques by degrading translation resources , 2002, SIGIR '02.

[23]  C. Ré,et al.  Worst-case optimal join algorithms: [extended abstract] , 2012, PODS '12.

[24]  John Langford,et al.  A reliable effective terascale linear learning system , 2011, J. Mach. Learn. Res..

[25]  Serge Abiteboul,et al.  Collaborative Access Control in WebdamLog , 2015, SIGMOD Conference.

[26]  Eyke Hüllermeier,et al.  Extreme F-measure Maximization using Sparse Probability Estimates , 2016, ICML.

[27]  Nicola Ferro,et al.  Toward an anatomy of IR system component performances , 2018, J. Assoc. Inf. Sci. Technol..

[28]  Rosalie Iemhoff On Rules , 2015, J. Philos. Log..

[29]  Dan Olteanu,et al.  Learning Linear Regression Models over Factorized Joins , 2016, SIGMOD Conference.

[30]  Izak Benbasat,et al.  E-Commerce Product Recommendation Agents: Use, Characteristics, and Impact , 2007, MIS Q..

[31]  James Bennett,et al.  The Netflix Prize , 2007 .

[32]  Frank Neven,et al.  SCULPT: A Schema Language for Tabular Data on the Web , 2015, WWW.

[33]  Diego Calvanese,et al.  Conjunctive query containment and answering under description logic constraints , 2008, TOCL.

[34]  Jianwen Su,et al.  Towards Formal Analysis of Artifact-Centric Business Process Models , 2007, BPM.

[35]  C. Buckley,et al.  Reliable Information Access Final Workshop Report , 2004 .

[36]  Diego Calvanese,et al.  Linking Data to Ontologies , 2008, J. Data Semant..

[37]  Thorsten Joachims,et al.  Taste Over Time: The Temporal Dynamics of User Preferences , 2013, ISMIR.

[38]  Paraschos Koutris,et al.  Communication steps for parallel query processing , 2013, PODS '13.

[39]  Chris Reed,et al.  Argumentation Schemes , 2008 .

[40]  Henry Prakken,et al.  The ASPIC+ framework for structured argumentation: a tutorial , 2014, Argument Comput..

[41]  Dániel Marx,et al.  Size Bounds and Query Plans for Relational Joins , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[42]  Alistair Moffat,et al.  Rank-biased precision for measurement of retrieval effectiveness , 2008, TOIS.

[43]  Michael Benedikt,et al.  XPath satisfiability in the presence of DTDs , 2008, JACM.

[44]  Lorna Balkan,et al.  TSNLP - Test Suites for Natural Language Processing , 1996, COLING.

[45]  Thomas Schwentick,et al.  Inference of concise regular expressions and DTDs , 2010, TODS.

[46]  Duncan J. Watts,et al.  Exploring Limits to Prediction in Complex Social Systems , 2016, WWW.

[47]  Phokion G. Kolaitis,et al.  Learning schema mappings , 2012, ICDT '12.

[48]  Georg Gottlob,et al.  Schema mapping discovery from data instances , 2010, JACM.

[49]  Daniel Deutch,et al.  A quest for beauty and wealth (or, business processes for database researchers) , 2011, PODS.

[50]  Christopher De Sa,et al.  Incremental Knowledge Base Construction Using DeepDive , 2015, The VLDB Journal.

[51]  Tetsuya Sakai,et al.  Topic set size design , 2015, Information Retrieval Journal.

[52]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[53]  Latanya Sweeney,et al.  Discrimination in online ad delivery , 2013, CACM.

[54]  Peter J. Haas,et al.  Ripple joins for online aggregation , 1999, SIGMOD '99.

[55]  Tomasz Imielinski,et al.  Incomplete Information in Relational Databases , 1984, JACM.

[56]  Anil Nigam,et al.  Business artifacts: An approach to operational specification , 2003, IBM Syst. J..

[57]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[58]  Thomas Schwentick,et al.  The price of query rewriting in ontology-based data access , 2014, Artif. Intell..

[59]  J. Wigmore The principles of judicial proof as given by logic, psychology, and general experience, and illustrated in judicial trials , 1988 .

[60]  Luigi Liquori,et al.  The Framework , 2005, Jews and Muslims in Lower Yemen.

[61]  Wim Martens,et al.  The (Almost) Complete Guide to Tree Pattern Containment , 2015, PODS.

[62]  Elad Yom-Tov,et al.  Estimating the query difficulty for information retrieval , 2010, Synthesis Lectures on Information Concepts, Retrieval, and Services.

[63]  Yong Yu,et al.  SVDFeature: a toolkit for feature-based collaborative filtering , 2012, J. Mach. Learn. Res..

[64]  Ch. Perelman,et al.  The New Rhetoric: A Treatise on Argumentation , 1971 .

[65]  Iovka Boneva,et al.  Complexity and Expressiveness of ShEx for RDF , 2015, ICDT.

[66]  Ellen M. Voorhees,et al.  Retrieval evaluation with incomplete information , 2004, SIGIR '04.

[67]  Tova Milo,et al.  On the Complexity of Evaluating Order Queries with the Crowd , 2015, IEEE Data Eng. Bull..

[68]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[69]  Markus Zanker,et al.  Replication and Reproduction in Recommender Systems Research - Evidence from a Case-Study with the rrecsys Library , 2017, IEA/AIE.

[70]  Martin Hepp,et al.  The Web of Data for E-Commerce: Schema.org and GoodRelations for Researchers and Practitioners , 2015, ICWE.

[71]  Eli Upfal,et al.  The VC-Dimension of SQL Queries and Selectivity Estimation through Sampling , 2011, ECML/PKDD.

[72]  Marie-Francine Moens,et al.  Argumentation Mining: Where are we now, where do we want to be and how do we get there? , 2013, FIRE.

[73]  Wolfgang Sander Uncertainty and Fuzziness , 1994 .

[74]  Jeffrey Heer,et al.  Enterprise Data Analysis and Visualization: An Interview Study , 2012, IEEE Transactions on Visualization and Computer Graphics.

[75]  Nigel Collier,et al.  Comparison between Tagged Corpora for the Named Entity Task , 2000, ACL 2000.

[76]  Juliana Freire,et al.  Provenance and scientific workflows: challenges and opportunities , 2008, SIGMOD Conference.

[77]  Ke Yi,et al.  Towards a Worst-Case I/O-Optimal Algorithm for Acyclic Joins , 2016, PODS.

[78]  Frederick Y. Wu,et al.  Business Artifact-Centric Modeling for Real-Time Performance Monitoring , 2011, BPM.

[79]  Manik Varma,et al.  FastXML: a fast, accurate and stable tree-classifier for extreme multi-label learning , 2014, KDD.

[80]  Olivier Chapelle,et al.  Expected reciprocal rank for graded relevance , 2009, CIKM.

[81]  Judith Masthoff,et al.  Layered evaluation of interactive adaptive systems: framework and formative methods , 2010, User Modeling and User-Adapted Interaction.

[82]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[83]  Serge Abiteboul,et al.  Comparing workflow specification languages: A matter of views , 2012, TODS.

[84]  Björn Scheuermann,et al.  Bitcoin and Beyond: A Technical Survey on Decentralized Digital Currencies , 2016, IEEE Communications Surveys & Tutorials.

[85]  RONALD FAGIN,et al.  Document Spanners , 2015, J. ACM.

[86]  Martín Ugarte,et al.  Foundations of JSON Schema , 2016, WWW.

[87]  Kevin Wilkinson,et al.  Data integration flows for business intelligence , 2009, EDBT '09.

[88]  Karin Baier,et al.  The Uses Of Argument , 2016 .

[89]  Oren Etzioni,et al.  Navigating Extracted Data with Schema Discovery , 2007, WebDB.

[90]  Ulrike Sattler,et al.  A Case for Abductive Reasoning over Ontologies , 2006, OWLED.

[91]  Witold Lipski,et al.  On semantic issues connected with incomplete information databases , 1979, ACM Trans. Database Syst..

[92]  Dan Suciu,et al.  Worst-Case Optimal Algorithms for Parallel Query Processing , 2016, ICDT.

[93]  Ben Carterette The Best Published Result is Random: Sequential Testing and its Effect on Reported Effectiveness , 2015, SIGIR.

[94]  Gianmarco De Francisci Morales,et al.  SAMOA: scalable advanced massive online analysis , 2015, J. Mach. Learn. Res..

[95]  Diego Calvanese,et al.  Foundations of data-aware process analysis: a database theory perspective , 2013, PODS.

[96]  Jonas Lerman,et al.  Big Data and Its Exclusions , 2013 .

[97]  Li Chen,et al.  A user-centric evaluation framework for recommender systems , 2011, RecSys '11.

[98]  D. Id,et al.  Evaluating sense disambiguation across diverse parameter spaces , 2002 .

[99]  Alejandro Bellogín,et al.  Statistical biases in Information Retrieval metrics for recommender systems , 2017, Information Retrieval Journal.

[100]  Alistair Moffat,et al.  Principles for robust evaluation infrastructure , 2011, DESIRE '11.

[101]  Moustapha Cissé,et al.  Robust Bloom Filters for Large MultiLabel Classification Tasks , 2013, NIPS.

[102]  Toniann Pitassi,et al.  Fairness through awareness , 2011, ITCS '12.

[103]  Egor V. Kostylev,et al.  Beyond Well-designed SPARQL , 2016, ICDT.

[104]  D. Walton Argumentation Schemes for Presumptive Reasoning , 1995 .

[105]  M. Richardson On weakly ordered systems , 1946 .

[106]  Giorgio Orsi,et al.  Query Rewriting and Optimization for Ontological Databases , 2014, TODS.

[107]  Michael Carl Tschantz,et al.  Automated Experiments on Ad Privacy Settings , 2014, Proc. Priv. Enhancing Technol..

[108]  Dan Suciu,et al.  Probabilistic Databases with MarkoViews , 2012, Proc. VLDB Endow..

[109]  Eva Blomqvist,et al.  Integrating Ontology Debugging and Matching into the eXtreme Design Methodology , 2015, WOP.

[110]  Rafael Peñaloza,et al.  The limits of decidability in fuzzy description logics with general concept inclusions , 2015, Artif. Intell..

[111]  E. F. Codd,et al.  Understanding Relations (Installment #7) , 1974, FDT Bull. ACM SIGFIDET SIGMOD.

[112]  Georg Gottlob,et al.  Efficient Algorithms for Processing XPath Queries , 2002, VLDB.

[113]  Serge Abiteboul,et al.  Data Responsibly: Fairness, Neutrality and Transparency in Data Analysis , 2016, EDBT.

[114]  Leonid Libkin Certain answers as objects and knowledge , 2016, Artif. Intell..

[115]  C. J. Date Database in depth - relational theory for practitioners , 2005 .

[116]  Jennifer Widom,et al.  Towards Globally Optimal Crowdsourcing Quality Management: The Uniform Worker Setting , 2016, SIGMOD Conference.

[117]  Paul Over,et al.  Blind Men and Elephants: Six Approaches to TREC data , 1999, Information Retrieval.

[118]  Sean M. McNee,et al.  Improving recommendation lists through topic diversification , 2005, WWW '05.

[119]  M. Arenas,et al.  SQL ' s Three-Valued Logic and Certain Answers , 2015 .

[120]  Nicola Ferro,et al.  Towards a Formal Framework for Utility-oriented Measurements of Retrieval Effectiveness , 2015, ICTIR.

[121]  Magdalena Ortiz,et al.  Closed Predicates in Description Logics: Results on Combined Complexity , 2016, AMW.

[122]  Julio Gonzalo,et al.  A general evaluation measure for document organization tasks , 2013, SIGIR.

[123]  Sergio Tessaris,et al.  Quelo: an Ontology-Driven Query Interface , 2011, Description Logics.

[124]  Noah D. Goodman The principles and practice of probabilistic programming , 2013, POPL.

[125]  John Riedl,et al.  Computing the Tag Genome , 2011 .

[126]  Tova Milo,et al.  BP-Ex: a uniform query engine for business process execution traces , 2010, EDBT '10.

[127]  Bin Wu,et al.  Wander Join: Online Aggregation via Random Walks , 2016, SIGMOD Conference.

[128]  Sean M. McNee,et al.  Being accurate is not enough: how accuracy metrics have hurt recommender systems , 2006, CHI Extended Abstracts.

[129]  Carsten Lutz,et al.  Ontology-Based Data Access , 2014, ACM Trans. Database Syst..

[130]  Marie-Francine Moens,et al.  Argumentation mining , 2011, Artificial Intelligence and Law.

[131]  Evaggelia Pitoura,et al.  DisC diversity: result diversification based on dissimilarity and coverage , 2012, Proc. VLDB Endow..

[132]  Todd L. Veldhuizen,et al.  Leapfrog Triejoin: A Simple, Worst-Case Optimal Join Algorithm , 2012, 1210.0481.

[133]  Jakub Závodný,et al.  Size Bounds for Factorised Representations of Query Results , 2015, TODS.

[134]  L. Thorne McCarty,et al.  The Representation of an Evolving System of Legal Concepts: II. Prototypes and Deformations , 1981, IJCAI.

[135]  Manfred Stede,et al.  From Argument Diagrams to Argumentation Mining in Texts: A Survey , 2013, Int. J. Cogn. Informatics Nat. Intell..

[136]  Jan Chomicki,et al.  Prioritized repairing and consistent query answering in relational databases , 2012, Annals of Mathematics and Artificial Intelligence.