Big Data Analysis and Reporting with Decision Tree Induction

Data mining methods are widely used across many disciplines to identify patterns, rules or associations among huge volumes of data. While in the past mostly black box methods such as neural nets and support vector machines have been heavily used in technical domains, methods that have explanation capability are preferred in medical domains. Nowadays, data mining methods with explanation capability are also used for technical domains after more work on advantages and disadvantages of the methods has been done. Decision tree induction such as C4.5 is the most preferred method since it works well on average regardless of the data set being used. This method can easily learn a decision tree without heavy user interaction while in neural nets a lot of time is spent on training the net. Cross-validation methods can be applied to decision tree induction methods; these methods ensure that the calculated error rate comes close to the true error rate. The error rate and the particular goodness measures described in this paper are quantitative measures that provide help in understanding the quality of the model. The data collection problem with its noise problem has to be considered. Specialized accuracy measures and proper visualization methods help to understand this problem. Since decision tree induction is a supervised method, the associated data labels constitute another problem. Re-labeling should be considered after the model has been learnt. This paper also discusses how to fit the learnt model to the expert ́s knowledge. The problem of comparing two decision trees in accordance with its explanation power is discussed. Finally, we summarize our methodology on interpretation of decision trees. Key-Words: Big Data Analysis, Reporting and Visualization, Decision Tree Induction, Comparison Decision Trees, Classification, Similarity Measure

[1]  Petra Perner,et al.  Prototype-based classification , 2008, Applied Intelligence.

[2]  Petra Perner A Method for Supporting the Domain Expert by the Interpretation of Different Decision Trees Learnt from the Same Domain , 2014, Qual. Reliab. Eng. Int..

[3]  Klaus Turowski,et al.  Parsing Effort in a B2B Integration Scenario - An Industrial Case Study , 2007, IESA.

[4]  Jonathan R. M. Hosking,et al.  Partitioning Nominal Attributes in Decision Trees , 1999, Data Mining and Knowledge Discovery.

[5]  Petra Perner,et al.  Data Mining on Multimedia Data , 2002, Lecture Notes in Computer Science.

[6]  Ramón López de Mántaras,et al.  A distance-based attribute selection measure for decision tree induction , 1991, Machine Learning.

[7]  Stephen Muggleton,et al.  Duce, An Oracle-based Approach to Constructive Induction , 1987, IJCAI.

[8]  Suk Lee,et al.  Intelligent performance management of networks for advanced manufacturing systems , 2001, IEEE Trans. Ind. Electron..

[9]  Wei Zhong Liu,et al.  Bias in information-based measures in decision tree induction , 1994, Machine Learning.

[10]  Jacques Bouaud,et al.  Does GEM-Encoding Clinical Practice Guidelines Improve the Quality of Knowledge Bases? A Study with the Rule-Based Formalism , 2003, AMIA.

[11]  Petra Perner,et al.  Decision Tree Induction Methods and Their Application to Big Data , 2015 .

[12]  J. R. Quilan Decision trees and multi-valued attributes , 1988 .

[13]  James R. Whiteley,et al.  A similarity-based approach to interpretation of sensor data using adaptive resonance theory , 1994 .

[14]  Petra Perner,et al.  A comparison between neural networks and decision trees based on data from industrial radiographic testing , 2001, Pattern Recognit. Lett..

[15]  Ramakant Nevatia,et al.  Improving Part based Object Detection by Unsupervised, Online Boosting , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.