How Far are we from Data Mining Democratisation? A Systematic Review

Context: Data mining techniques have demonstrated to be a powerful technique for discovering insights hidden in data from a domain. However, these techniques demand very specialised skills. People willing to analyse data often lack these skills, so they must rely on data scientists, which hinders data mining democratisation. Different approaches have appeared in the last years to address this issue. Objective: Analyse the state of the art to know how far are we from an effective data mining democratisation, what has already been accomplished, and what should be done in the upcoming years. Method: We performed a state-of-the-art review following a systematic and objective procedure, which included works both from the academia and the industry. The reviewed works were grouped in four categories. Each category was then evaluated in detail using a well-defined evaluation criteria to identify its strengths and weaknesses. Results: Around 700 works were initially considered, from which 43 were finally selected for a more in-depth analysis. Only two out of the four identified categories provide effective solutions to data mining democratisation. From these two categories, one always requires a minimum intervention of a data scientist, whereas the other one does not provide support for all the stages of the data mining process, and might exhibit accuracy problems in some contexts. Conclusion: In all analysed approaches, a data scientist is still required to perform some steps of the analysis process. Moreover, automated approaches that do not require data scientists for some steps expose some problems in other quality attributes, such as accuracy. Therefore, although existent work shows some promising initial steps, we are still far from data mining democratisation.

[1]  Luca Chittaro,et al.  Data mining on temporal data: a visual approach and its clinical application to hemodialysis , 2003, J. Vis. Lang. Comput..

[2]  Bernard Kamsu-Foguem,et al.  User-centered visual analysis using a hybrid reasoning architecture for intensive care units , 2012, Decision Support Systems.

[3]  M. Arthur Munson,et al.  A study on the importance of and time spent on different modeling steps , 2012, SKDD.

[4]  Richard T. Watson,et al.  Analyzing the Past to Prepare for the Future: Writing a Literature Review , 2002, MIS Q..

[5]  Claes Wohlin,et al.  Experimentation in Software Engineering , 2000, The Kluwer International Series in Software Engineering.

[6]  Pearl Brereton,et al.  Lessons from applying the systematic literature review process within the software engineering domain , 2007, J. Syst. Softw..

[7]  Gregory Piatetsky-Shapiro,et al.  The KDD process for extracting useful knowledge from volumes of data , 1996, CACM.

[8]  José L. Balcázar,et al.  Towards Parameter-free Data Mining: Mining Educational Data with Yacaree , 2011, EDM.

[9]  David H. Wolpert,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996, Neural Computation.

[10]  Gunther Heidemann,et al.  Interactive survival analysis with the OCDM system: From development to application , 2009, Inf. Syst. Frontiers.

[11]  Christophe Kolski,et al.  A human-centred methodology applied to decision support system design and evaluation in a railway network context , 2003, Cognition, Technology & Work.

[12]  Magne Jørgensen,et al.  A Systematic Review of Software Development Cost Estimation Studies , 2007, IEEE Transactions on Software Engineering.

[13]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[14]  Sebastián Ventura,et al.  A collaborative educational association rule mining tool , 2011, Internet High. Educ..

[15]  Aaron Klein,et al.  Efficient and Robust Automated Machine Learning , 2015, NIPS.

[16]  Chris Woolston,et al.  Information management: Data domination , 2017, Nature.

[17]  Alberto Abelló,et al.  Intelligent assistance for data pre-processing , 2018, Comput. Stand. Interfaces.

[18]  Longbing Cao,et al.  Domain-Driven Data Mining: Challenges and Prospects , 2010, IEEE Transactions on Knowledge and Data Engineering.

[19]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[20]  Olivier Poch,et al.  KD4v: comprehensible knowledge discovery system for missense variant , 2012, Nucleic Acids Res..

[21]  Björn Hartmann,et al.  Machine Learning for Makers: Interactive Sensor Data Classification Based on Augmented Code Examples , 2017, Conference on Designing Interactive Systems.

[22]  Sven F. Crone,et al.  The impact of preprocessing on data mining: An evaluation of classifier sensitivity in direct marketing , 2006, Eur. J. Oper. Res..

[23]  Hans-Peter Kriegel,et al.  Towards an Effective Cooperation of the Computer and the User for Classification , 2000, KDD 2000.

[24]  Abraham Bernstein,et al.  A survey of intelligent assistants for data analysis , 2013, CSUR.

[25]  Tennessee.,et al.  First Presbyterian Church , 2020, Architecture of Middle Tennessee.

[26]  Adel M. Alimi,et al.  A user-centered approach for the design and implementation of KDD-based DSS: A case study in the healthcare domain , 2010, Decis. Support Syst..

[27]  P. Krutchen,et al.  The Rational Unified Process: An Introduction , 2000 .

[28]  Longbing Cao Data science and analytics: a new era , 2016, International Journal of Data Science and Analytics.

[29]  Claes Wohlin,et al.  Systematic literature studies: Database searches vs. backward snowballing , 2012, Proceedings of the 2012 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement.

[30]  Marta E. Zorrilla,et al.  A service oriented architecture to provide data mining services for non-expert data miners , 2013, Decis. Support Syst..

[31]  Magne Jørgensen,et al.  A Systematic Review of Software Development Cost Estimation Studies , 2007 .

[32]  Liping Di,et al.  User-oriented agricultural drought information cluster , 2014, 2014 IEEE Geoscience and Remote Sensing Symposium.

[33]  Liping Di,et al.  Delivery of agricultural drought information via web services , 2015, Earth Science Informatics.

[34]  Claes Wohlin,et al.  Systematic literature reviews in software engineering , 2013, Inf. Softw. Technol..

[35]  Sjaak Brinkkemper,et al.  HC StratoMineR: A Web-Based Tool for the Rapid Analysis of High-Content Datasets. , 2016, Assay and drug development technologies.

[36]  Hendrik Blockeel,et al.  A declarative query language for statistical inference , 2013 .

[37]  Ron Kohavi,et al.  Data Mining using MLC , 1996 .

[38]  Claes Wohlin,et al.  Systematic Literature Reviews , 2012 .

[39]  Efraim Turban,et al.  Business Intelligence: Second European Summer School, eBISS 2012, Brussels, Belgium, July 15-21, 2012, Tutorial Lectures , 2013 .

[40]  Jennifer Widom,et al.  The Beckman Report on Database Research , 2014, SGMD.

[41]  Ron Kohavi,et al.  Data Mining Using MLC a Machine Learning Library in C++ , 1996, Int. J. Artif. Intell. Tools.

[42]  Solomon Negash Business Intelligence , 2011, Lecture Notes in Business Information Processing.

[43]  Bernhard Mitschang,et al.  Towards Interactive Data Processing and Analytics - Putting the Human in the Center of the Loop , 2017, ICEIS.

[45]  Jorge Marx Gómez,et al.  A Tracing System for User Interactions towards Knowledge Extraction of Power Users in Business Intelligence Systems , 2016, KMIS.

[46]  Kalyan Veeramachaneni,et al.  Deep feature synthesis: Towards automating data science endeavors , 2015, 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[47]  Karen Corral,et al.  Enabling self-service BI: A methodology and a case study for a model management warehouse , 2018, Inf. Syst. Frontiers.

[48]  Claes Wohlin,et al.  Guidelines for snowballing in systematic literature studies and a replication in software engineering , 2014, EASE '14.

[49]  Edvard Tijan,et al.  Cluster analysis of student activity in a web-based intelligent tutoring system , 2015 .

[50]  Andreas Dengel,et al.  Automatic classifier selection for non-experts , 2012, Pattern Analysis and Applications.

[51]  John J. Miles,et al.  Mining, visualizing and comparing multidimensional biomolecular data using the Genomics Data Miner (GMine) Web-Server , 2016, Scientific Reports.

[52]  José M. Alonso,et al.  Building Cognitive Cities with Explainable Artificial Intelligent Systems , 2017, CEx@AI*IA.

[53]  Ricardo Vilalta,et al.  Metalearning - Applications to Data Mining , 2008, Cognitive Technologies.

[54]  Paul Alpar,et al.  Self-Service Business Intelligence , 2016, Bus. Inf. Syst. Eng..

[55]  Nayem Rahman,et al.  Self-Service Business Intelligence Resulting in Disruptive Technology , 2016, J. Comput. Inf. Syst..

[56]  Stefan Decker,et al.  ReVeaLD: A user-driven domain-specific interactive search platform for biomedical research , 2014, J. Biomed. Informatics.

[57]  Marta E. Zorrilla,et al.  Enabling Non-expert Users to Apply Data Mining for Bridging the Big Data Divide , 2013, SIMPDA.

[58]  Qi Lu,et al.  Research on data mining service and its application case in complex industrial process , 2017, 2017 13th IEEE Conference on Automation Science and Engineering (CASE).

[59]  Gottfried Vossen,et al.  Towards Self-Service Business Intelligence , 2013 .

[60]  Sven A. Carlsson,et al.  From an information consumer to an information author: a new approach to business intelligence , 2018, J. Organ. Comput. Electron. Commer..

[61]  Catherine Garbay,et al.  Knowledge construction from time series data using a collaborative exploration system , 2007, J. Biomed. Informatics.

[62]  Marcos M. Campos,et al.  Data-centric automated data mining , 2005, Fourth International Conference on Machine Learning and Applications (ICMLA'05).

[63]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.