Inductive data mining based on genetic programming: Automatic generation of decision trees from data for process historical data analysis

An inductive data mining algorithm based on genetic programming, GPForest, is introduced for automatic construction of decision trees and applied to the analysis of process historical data. GPForest not only outperforms traditional decision tree generation methods that are based on a greedy search strategy therefore necessarily miss regions of the search space, but more importantly generates multiple trees in each experimental run. In addition, by varying the initial values of parameters, more decision trees can be generated in new experiments. From the multiple decision trees generated, those with high fitness values are selected to form a decision forest. For predictive purpose, the decision forest instead of a single tree is used and a voting strategy is employed which allows the combination of the predictions of all decision trees in the forest in order to generate the final prediction. It was demonstrated that in comparison with decision tree methods in the literature, GPForest gives much improved performance.

[1]  Bhaskar D. Kulkarni,et al.  An ant colony classifier system: application to some process engineering problems , 2004, Comput. Chem. Eng..

[2]  Steven L. Dixon,et al.  Induction of Decision Trees via Evolutionary Programming , 2004, J. Chem. Inf. Model..

[3]  Shuang-Hua Yang,et al.  Fuzzy rule generation from data for process operational decision support , 1997 .

[4]  R. Mike Cameron-Jones,et al.  FOIL: A Midterm Report , 1993, ECML.

[5]  Prem K. Goel,et al.  Process modeling by Bayesian latent variable regression , 2002 .

[6]  X. Wang,et al.  Qualitative/quantitative simulation of process temporal behavior using clustered fuzzy digraphs , 2001 .

[7]  Nathaniel A. Woody,et al.  Rejecting unclassifiable samples with decision forests , 2006 .

[8]  Christine W. Chan,et al.  Artificial intelligence for monitoring and supervisory control of process systems , 2007, Eng. Appl. Artif. Intell..

[9]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[10]  Susan Y. Tamura,et al.  Rule Extraction from a Mutagenicity Data Set Using Adaptively Grown Phylogenetic-like Trees , 2002, J. Chem. Inf. Comput. Sci..

[11]  Peter Clark,et al.  Rule Induction with CN2: Some Recent Improvements , 1991, EWSL.

[12]  Raghunathan Rengaswamy,et al.  A review of process fault detection and diagnosis: Part II: Qualitative models and search strategies , 2003, Comput. Chem. Eng..

[13]  Leon N. Cooper,et al.  Training Data Selection for Support Vector Machines , 2005, ICNC.

[14]  Ebrahim Mamdani,et al.  Applications of fuzzy algorithms for control of a simple dynamic plant , 1974 .

[15]  M. Iri,et al.  An algorithm for diagnosis of system failures in the chemical process , 1979 .

[16]  Tom M. Mitchell,et al.  Version Spaces: A Candidate Elimination Approach to Rule Learning , 1977, IJCAI.

[17]  Sung Jin Cho,et al.  Binary Formal Inference-Based Recursive Modeling Using Multiple Atom and Physicochemical Property Class Pair and Torsion Descriptors as Decision Criteria , 2000, J. Chem. Inf. Comput. Sci..

[18]  Modesto Castrillón,et al.  Face recognition using independent component analysis and support vector machines , 2003 .

[19]  Pedro M. Saraiva,et al.  Continuous process improvement through inductive and analogical learning , 1992 .

[20]  Ignasi Rodríguez-Roda,et al.  Conceptual design of wastewater treatment plants using a design support system , 2000 .

[21]  Manel Poch,et al.  Fault detection in a real wastewater plant using parameter-estimation techniques , 1996 .

[22]  Weida Tong,et al.  Using Decision Forest to Classify Prostate Cancer Samples on the Basis of SELDI-TOF MS Data: Assessing Chance Correlation and Prediction Confidence , 2004, Environmental health perspectives.

[23]  Bhavik R. Bakshi,et al.  Representation of process trends—IV. Induction of real-time patterns from operating data for diagnosis and supervisory control , 1994 .

[24]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[25]  Chris Aldrich,et al.  Improving process operations using support vector machines and decision trees , 2005 .

[26]  C. McGreavy,et al.  Data Mining and Knowledge Discovery for Process Monitoring and Control , 1999 .

[27]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[28]  R. King,et al.  Prediction of rodent carcinogenicity bioassays from molecular structure using inductive logic programming. , 1996, Environmental health perspectives.

[29]  Yali Amit,et al.  Shape Quantization and Recognition with Randomized Trees , 1997, Neural Computation.

[30]  X. Wang,et al.  Multidimensional Visualization of Principal Component Scores for Process Historical Data Analysis , 2004 .

[31]  Jerzy W. Bala,et al.  Hybrid Learning Using Genetic Algorithms and Decision Trees for Pattern Classification , 1995, IJCAI.

[32]  D. Seborg,et al.  Pattern Matching in Historical Data , 2002 .

[33]  Yingwei Zhang,et al.  Fault Detection and Diagnosis of Nonlinear Processes Using Improved Kernel Independent Component Analysis (KICA) and Support Vector Machine (SVM) , 2008 .

[34]  Xiaodong Li,et al.  Fault Diagnosis of WWTP Based on Improved Support Vector Machine , 2006 .

[35]  X Z Wang,et al.  Induction of decision trees using genetic programming for modelling ecotoxicity data: adaptive discretization of real-valued endpoints , 2006, SAR and QSAR in environmental research.

[36]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[37]  Byoung-Tak Zhang,et al.  Genetic Programming with Active Data Selection , 1998, SEAL.

[38]  Bart Baesens,et al.  Decompositional Rule Extraction from Support Vector Machines by Active Learning , 2009, IEEE Transactions on Knowledge and Data Engineering.

[39]  Ashwin Srinivasan,et al.  Warmr: a data mining tool for chemical data , 2001, J. Comput. Aided Mol. Des..

[40]  Xue Z. Wang,et al.  Multidimensional visualisation for process historical data analysis: a comparative study with multivariate statistical process control , 2005 .

[41]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[42]  Stephen Muggleton,et al.  Efficient Induction of Logic Programs , 1990, ALT.

[43]  Y. C. Huang,et al.  Application of fuzzy causal networks to waste water treatment plants , 1999 .

[44]  C. McGreavy,et al.  Qualitative process modelling: a fuzzy signed directed graph method , 1995 .

[45]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[46]  Christophe G. Lambert,et al.  Analysis of a Large Structure/Biological Activity Data Set Using Recursive Partitioning , 1999, J. Chem. Inf. Comput. Sci..

[47]  Miquel Sànchez-Marrè,et al.  Concept Formation in WWTP by Means of Classification Techniques: A Compared Study , 1997, Applied Intelligence.

[48]  Anita Young,et al.  Genetic Programming for the Induction of Decision Trees to Model Ecotoxicity Data , 2005, J. Chem. Inf. Model..

[49]  X. Wang,et al.  Historical data analysis based on plots of independent and parallel coordinates and statistical control limits , 2006 .

[50]  Weida Tong,et al.  Decision Forest: Combining the Predictions of Multiple Independent Decision Tree Models , 2003, J. Chem. Inf. Comput. Sci..

[51]  Michio Sugeno,et al.  Fuzzy identification of systems and its applications to modeling and control , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[52]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[53]  D. Seborg,et al.  Clustering multivariate time‐series data , 2005 .

[54]  C. McGreavy,et al.  Automatic Classification for Mining Process Operational Data , 1998 .

[55]  Peter Clark,et al.  The CN2 Induction Algorithm , 1989, Machine Learning.

[56]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .