Dataset retrieval system based on automation of data preparation with dataset description model

Data preparation is the most effortful task in the process of statistical learning. Many studies related to data mining are performed without data preparation by assuming that qualified datasets are already prepared. It may hide useful patterns of data, which can result in poor performance and incorrect learning. Automation of data preparation can solve these problems. For automation of data preparation, a few issues should be considered, such as flexible expression of requirements according to the purpose of the learning model, accessibility to data sources, and performance degradation due to automation. In this paper, we propose a dataset description model that can express the requirements for data processing and dataset retrieval system based on automated data preparation. The proposed system makes it possible to provide good quality datasets for statistical learning applications using data preparation methods such as data acquisition, refinement, and organization. In the experiment, we demonstrate that the proposed system doesn't have performance loss as compared to the existing manual systems. Moreover, the quality of the datasets are also improved by using the proposed system.

[1]  Angel Flores-Abad,et al.  Monitoring of Cardiac Arrhythmia Patterns by Adaptive Analysis , 2016, 3PGCIC.

[2]  Stephan Achenbach,et al.  Comparison of real-time classification systems for arrhythmia detection on Android-based mobile devices , 2014, 2014 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[3]  Dheeraj Raju,et al.  Exploring factors associated with pressure ulcers: a data mining approach. , 2015, International journal of nursing studies.

[4]  Francisco Herrera,et al.  On the importance of the validation technique for classification with imbalanced datasets: Addressing covariate shift when data is skewed , 2014, Inf. Sci..

[5]  Jianping Yin,et al.  Boosting weighted ELM for imbalanced learning , 2014, Neurocomputing.

[6]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[7]  Panos Vassiliadis A Survey of Extract-Transform-Load Technology , 2009, Int. J. Data Warehous. Min..

[8]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[9]  Thorsten Meinl,et al.  KNIME - the Konstanz information miner: version 2.0 and beyond , 2009, SKDD.

[10]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[11]  Michael W. Godfrey,et al.  Mining modern repositories with elasticsearch , 2014, MSR 2014.

[12]  Xiaoyong Du,et al.  A Study of SQL-on-Hadoop Systems , 2014, BPOE@ASPLOS/VLDB.

[13]  Patricia A Patrician Military Nursing Outcomes Database (MilNOD IV): Analysis & Expansion , 2011 .

[14]  G.B. Moody,et al.  The impact of the MIT-BIH Arrhythmia Database , 2001, IEEE Engineering in Medicine and Biology Magazine.

[15]  Daniela M. Witten,et al.  An Introduction to Statistical Learning: with Applications in R , 2013 .

[16]  Roshan Joy Martis,et al.  Discrete Cosine Transform Features in Automated Classification of Cardiac Arrhythmia Beats , 2015 .

[17]  Michael I. Jordan,et al.  Machine learning: Trends, perspectives, and prospects , 2015, Science.

[18]  Jun Bai,et al.  Feasibility analysis of big log data real time search based on Hbase and ElasticSearch , 2013, 2013 Ninth International Conference on Natural Computation (ICNC).

[19]  Zhiwen Yu,et al.  Hybrid Adaptive Classifier Ensemble , 2015, IEEE Transactions on Cybernetics.

[20]  Divya Tomar,et al.  A survey on Data Mining approaches for Healthcare , 2013, BSBT 2013.

[21]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .