Making data analysis expertise broadly accessible through workflows

The demand for advanced skills in data analysis spans many areas of science, computing, and business analytics. This paper discusses how non-expert users reuse workflows created by experts and representing complex data mining processes for text analytics. They include workflows for document classification, document clustering, and topic detection, all assembled from components available in well-known text analytics software libraries. The workflows expose to non-experts expert-level knowledge on how these individual components need to be combined with data preparation and feature selection steps to make the underlying statistical learning algorithms most effective. The framework allows non-experts to easily experiment with different combinations of data analysis processes, represented as workflows of computations that they can easily reconfigure. We report on our experiences to date on having users with limited data analytic knowledge and even basic programming skills to apply workflows to their data.

[1]  Ian H. Witten,et al.  Weka: Practical machine learning tools and techniques with Java implementations , 1999 .

[2]  Dennis Gannon,et al.  Workflows for e-Science, Scientific Workflows for Grids , 2014 .

[3]  Yolanda Gil,et al.  Self-Configuring Applications for Heterogeneous Systems: Program Composition and Optimization Using Cognitive Techniques , 2008, Proceedings of the IEEE.

[4]  Yolanda Gil,et al.  Assisting Scientists with Complex Data Analysis Tasks through Semantic Workflows , 2010, AAAI Fall Symposium: Proactive Assistant Agents.

[5]  Ron Kohavi,et al.  Data Mining Using MLC a Machine Learning Library in C++ , 1996, Int. J. Artif. Intell. Tools.

[6]  Yan Liu,et al.  A Framework for Efficient Data Analytics through Automatic Configuration and Customization of Scientific Workflows , 2011, 2011 IEEE Seventh International Conference on eScience.

[7]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[8]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[9]  Wu Xiao-qin Study on feature selection in text categorization , 2008 .

[10]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[11]  Anthony J. G. Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery [Point of View] , 2011 .

[12]  Alexander S. Szalay,et al.  Petascale Computational Systems: Balanced CyberInfrastructure in a Data-Centric World , 2006 .

[13]  Carl E. Rasmussen,et al.  The Need for Open Source Software in Machine Learning , 2007, J. Mach. Learn. Res..

[14]  James A. Landay,et al.  Gestalt: integrated support for implementation and analysis in machine learning , 2010, UIST.

[15]  Carole A. Goble,et al.  The design and realisation of the myExperiment Virtual Research Environment for social sharing of workflows , 2009, Future Gener. Comput. Syst..

[16]  Marc Spraragen,et al.  Principles for interactive acquisition and validation of workflows , 2010, J. Exp. Theor. Artif. Intell..

[17]  Alexander S. Szalay,et al.  Petascale computational systems , 2007, Computer.

[18]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[19]  Gordon Bell,et al.  Beyond the Data Deluge , 2009, Science.

[20]  Geoffrey C. Fox,et al.  Examining the Challenges of Scientific Workflows , 2007, Computer.

[21]  George Karypis,et al.  CLUTO - A Clustering Toolkit , 2002 .

[22]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.