Lavoisier: A DSL for increasing the level of abstraction of data selection and formatting in data mining

Abstract Input data of a data mining algorithm must conform to a very specific tabular format. Data scientists arrange data into that format by creating long and complex scripts, where different low-level operations are performed, and which can be a time-consuming and error-prone process. To alleviate this situation, we present Lavoisier, a declarative language for data selection and formatting in a data mining context. Using Lavoisier, script size for data preparation can be reduced by ∼ 40% on average, and by up to 80% in some cases. Additionally, accidental complexity present in state-of-the-art technologies is considerably mitigated.

[1]  Yves Le Traon,et al.  The next evolution of MDE: a seamless integration of machine learning into domain modeling , 2017, 2017 ACM/IEEE 20th International Conference on Model Driven Engineering Languages and Systems (MODELS).

[2]  Nada Lavrač,et al.  Relational Data Mining , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[3]  José Maria Parente de Oliveira,et al.  A data mining system for providing analytical information on brain tumors to public health decision makers , 2013, Comput. Methods Programs Biomed..

[4]  Edward L. Robertson,et al.  A formal characterization of PIVOT/UNPIVOT , 2005, CIKM '05.

[5]  Florence Le Ber,et al.  Exploring Heterogeneous Sequential Data on River Networks with Relational Concept Analysis , 2018, ICCS.

[6]  João Gama,et al.  Contrasting logical sequences in multi-relational learning , 2019, Progress in Artificial Intelligence.

[7]  ChenPeter Pin-Shan The entity-relationship modeltoward a unified view of data , 1976 .

[8]  Stefan Decker,et al.  ReVeaLD: A user-driven domain-specific interactive search platform for biomedical research , 2014, J. Biomed. Informatics.

[9]  Evans,et al.  Domain-driven design , 2003 .

[10]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[11]  Marta E. Zorrilla,et al.  FLANDM: a development framework of domain-specific languages for data mining democratisation , 2018, Comput. Lang. Syst. Struct..

[12]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[13]  Thorsten Meinl,et al.  KNIME - the Konstanz information miner: version 2.0 and beyond , 2009, SKDD.

[14]  Anneke Kleppe,et al.  Software Language Engineering: Creating Domain-Specific Languages Using Metamodels , 2008 .

[15]  Simon Parkinson,et al.  Discovering and utilising expert knowledge from security event logs , 2019, J. Inf. Secur. Appl..

[16]  Marta E. Zorrilla,et al.  Lavoisier: High-Level Selection and Preparation of Data for Analysis , 2019, MEDI.

[17]  M. Marques,et al.  Recent advances and applications of machine learning in solid-state materials science , 2019, npj Computational Materials.

[18]  M. Arthur Munson,et al.  A study on the importance of and time spent on different modeling steps , 2012, SKDD.

[19]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[20]  E. F. Codd,et al.  A relational model of data for large shared data banks , 1970, CACM.

[21]  Arno J. Knobbe,et al.  Propositionalisation and Aggregates , 2001, PKDD.

[22]  Lynn Beighley Head First SQL , 2007 .

[23]  Michele Samorani Automatically Generate a Flat Mining Table with Dataconda , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[24]  M Mernik,et al.  When and how to develop domain-specific languages , 2005, CSUR.

[25]  Goetz Graefe,et al.  PIVOT and UNPIVOT: Optimization and Execution Strategies in an RDBMS , 2004, VLDB.

[26]  Peter P. Chen The entity-relationship model: toward a unified view of data , 1975, VLDB '75.

[27]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[28]  Jeffrey C. Carver,et al.  Program comprehension of domain-specific and general-purpose languages: replication of a family of experiments using integrated development environments , 2017, Empirical Software Engineering.

[29]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[30]  Miguel Goulão,et al.  Usability driven DSL development with USE-ME , 2018, Comput. Lang. Syst. Struct..

[31]  Andrea Passerini,et al.  Relational Feature Mining with Hierarchical Multitask kFOIL , 2011, Fundam. Informaticae.

[32]  Sven F. Crone,et al.  The impact of preprocessing on data mining: An evaluation of classifier sensitivity in direct marketing , 2006, Eur. J. Oper. Res..

[33]  Marjan Mernik,et al.  Domain-Specific Languages: A Systematic Mapping Study , 2016, Inf. Softw. Technol..

[34]  Liping Di,et al.  Delivery of agricultural drought information via web services , 2015, Earth Science Informatics.

[35]  Richard F. Paige,et al.  A tutorial on metamodelling for grammar researchers , 2014, Sci. Comput. Program..

[36]  M. Narasimha Murty,et al.  Combining heterogeneous classifiers for relational databases , 2013, Pattern Recognit..

[37]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[38]  Nicolas Lachiche,et al.  A scalable robust and automatic propositionalization approach for Bayesian classification of large mixed numerical and categorical data , 2018, Machine Learning.

[39]  Martin Fowler,et al.  Patterns of Enterprise Application Architecture , 2002 .

[40]  Hannes Voigt,et al.  Graph Query Languages , 2019, Encyclopedia of Big Data Technologies.

[41]  Nayem Rahman,et al.  Self-Service Business Intelligence Resulting in Disruptive Technology , 2016, J. Comput. Inf. Syst..

[42]  Kalyan Veeramachaneni,et al.  Deep feature synthesis: Towards automating data science endeavors , 2015, 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[43]  Heiko Behrens,et al.  Xtext: implement your language faster than the quick and dirty way , 2010, SPLASH/OOPSLA Companion.