Data preparation for KDD through automatic reasoning based on description logic

Abstract Without data preparation, data mining algorithms cannot operate on data within the knowledge discovery in databases (KDD) process. In fact, the success of later KDD phases largely depends on the data preparation stage. The use of mechanisms for automatically preparing data saves a lot of time and resources within the KDD process. These resources will then be available for use at later, less automatable stages, for example, during results interpretation. We have proposed a general-purpose mechanism applicable to multiple domains in order to improve the data preparation phase in the KDD process. This mechanism processes and automatically converts input data to a suitable format for the application of different data preparation techniques based on a known syntax. It is based on the use of description logic. Taking a generic UML2 data model as a reference, this mechanism is able to check whether any XML data source whatsoever can be transformed and modelled as a subsumption or instance of the above UML2 model. Thus it automatically identifies a consistent, non-ambiguous and finite set of XLST transformations which are used to prepare the data for the application of data mining techniques, obviating the need to expend resources on the preliminary preparation and formatting stage. The proposed mechanism was applied on structurally complex data from four different domains. In order to test the validity of the proposal, we have applied data mining techniques to extract knowledge from the prepared data. The sound results of applying our proposal to several different domains confirm that it is applicable to any XML data source, as well as being correct, computationally efficient and saving time during the data preparation phase.

[1]  Volker Haarslev,et al.  High Performance Reasoning with Very Large Knowledge Bases: A Practical Case Study , 2000, IJCAI.

[2]  Tom Mens,et al.  Using Description Logic to Maintain Consistency between UML Models , 2003, UML.

[3]  Parag C. Pendharkar,et al.  Technical efficiency-based selection of learning cases to improve forecasting accuracy of neural networks under monotonicity assumption , 2003, Decis. Support Syst..

[4]  Jocelyn Simmonds,et al.  A tool for automatic UML model consistency checking , 2005, ASE '05.

[5]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[6]  K Lehnertz,et al.  Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: dependence on recording region and brain state. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[7]  Ivar Jacobson,et al.  Unified Modeling Language , 2020, Definitions.

[8]  Juan Alfonso Lara,et al.  Comparing Time Series through Event Clustering , 2008, IWPACBB.

[9]  Diego Calvanese,et al.  Representing and Reasoning on XML Documents: A Description Logic Approach , 1999, J. Log. Comput..

[10]  Aurora Pérez,et al.  A Language for Defining Events in Multi-Dimensional Time Series: Application to a Medical Domain , 2009 .

[11]  Tsau Young Lin,et al.  Attribute transformations for data mining I: Theoretical explorations , 2002, Int. J. Intell. Syst..

[12]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[13]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[14]  Miguel Toro,et al.  Finding representative patterns with ordered projections , 2003, Pattern Recognit..

[15]  Xiaodong Zhu,et al.  Construction and management of automatical reasoning supported data mining metadata , 2011, 2011 International Conference on Business Management and Electronic Information.

[16]  Roberto Souto Maior de Barros,et al.  Automating Data Preprocessing with DMPML and KDDML , 2011, 2011 10th IEEE/ACIS International Conference on Computer and Information Science.

[17]  B. John Oommen,et al.  Enhancing prototype reduction schemes with LVQ3-type algorithms , 2003, Pattern Recognit..

[18]  Hugues Bersini,et al.  Integration and cross‐validation of high‐throughput gene expression data: comparing heterogeneous data sets , 2003, FEBS letters.

[19]  Lara Torralbo,et al.  Marco de Descubrimiento de Conocimiento para DatosEstructuralmente Complejos con Énfasis en el Análisis de Eventos en Series Temporales , 2011 .

[20]  Luigi Palopoli,et al.  A Plausibility Description Logic for Handling Information Sources with Heterogeneous Data Representation Formats , 2004, Annals of Mathematics and Artificial Intelligence.

[21]  Ulrike Sattler,et al.  Terminological knowledge representation systems in a process engineering application , 1998 .

[22]  Juan Alfonso Lara,et al.  Modelling Stabilometric Time Series , 2010, HEALTHINF.

[23]  Juan Alfonso Lara,et al.  Generating time series reference models based on event analysis , 2010, ECAI.

[24]  Pericles A. Mitkas,et al.  An integrated framework for enhancing the semantic transformation, editing and querying of relational databases , 2011, Expert Syst. Appl..

[25]  Andrea Calì,et al.  A Formal Framework for Reasoning on UML Class Diagrams , 2002, ISMIS.

[26]  Thomas Reinartz,et al.  A Unifying View on Instance Selection , 2002, Data Mining and Knowledge Discovery.

[27]  Roberto Souto Maior de Barros,et al.  DMPML Data Mining Preparation Markup Language , 2008, 2008 IEEE/ACS International Conference on Computer Systems and Applications.

[28]  Chengqi Zhang,et al.  Data preparation for data mining , 2003, Appl. Artif. Intell..

[29]  D. Bernardi Reasoning on UML Class Diagrams using Description Logic Based Systems , 2001 .

[30]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[31]  Doheon Lee,et al.  A Taxonomy of Dirty Data , 2004, Data Mining and Knowledge Discovery.

[32]  Juan Alfonso Lara,et al.  Comparing Posturographic Time Series through Events Detection , 2008, 2008 21st IEEE International Symposium on Computer-Based Medical Systems.

[33]  Wei-Min Shen,et al.  Data Preprocessing and Intelligent Data Analysis , 1997, Intell. Data Anal..

[34]  Lluís A. Belanche Muñoz,et al.  Feature selection algorithms: a survey and experimental evaluation , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[35]  Francisco Javier,et al.  Modelo de arquitectura para gestión cooperativa de sistemas y servicios distribuidos basado en agentes autónomos , 2015 .

[36]  Chengqi Zhang,et al.  Toward databases mining: Pre-processing collected data , 2003, Appl. Artif. Intell..

[37]  B. John Oommen,et al.  Enhancing prototype reduction schemes with recursion: a method applicable for "large" data sets , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).