Data Preprocessing and Intelligent Data Analysis

This paper first provides an overview of data preprocessing, focusing on problems of real world data. These are primarily problems that have to be carefully understood and solved before any data analysis process can start. The paper discusses in detail two main reasons for performing data preprocessing: i problems with the data and ii preparation for data analysis. The paper continues with details of data preprocessing techniques achieving each of the above mentioned objectives. A total of 14 techniques are discussed. Two examples of data preprocessing applications from two of the most data rich domains are given at the end. The applications are related to semiconductor manufacturing and aerospace domains where large amounts of data are available, and they are fairly reliable. Future directions and some challenges are discussed at the end.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Onno E. de Noord,et al.  The influence of data preprocessing on the robustness and parsimony of multivariate calibration models , 1994 .

[3]  J. T. W. E. Vogels A new method for classification of wines based on proton and carbon-13 NMR spectroscopy in combination with pattern recognition techniques. Chemometrics and Intelligent Laboratory Systems , 1993 .

[4]  Oren Etzioni,et al.  Representation design and brute-force induction in a Boeing manufacturing domain , 1994, Appl. Artif. Intell..

[5]  George A. Bekey,et al.  GAIT-ER-AID: An Expert System for Analysis of Gait with Automatic Intelligent Pre-Processing of Data. , 1990 .

[6]  Peter D. Turney,et al.  Intelligently helping the human planner in industrial process planning , 1991, Artificial Intelligence for Engineering Design, Analysis and Manufacturing.

[7]  Patrick M. Kelly,et al.  Preprocessing remotely sensed data for efficient analysis and classification , 1993, Defense, Security, and Sensing.

[8]  Sankar K. Pal,et al.  Fuzzy models for pattern recognition , 1992 .

[9]  Hans-Jürgen Zimmermann,et al.  Fuzzy Set Theory - and Its Applications , 1985 .

[10]  Sholom M. Weiss,et al.  Computer Systems That Learn , 1990 .

[11]  Lawrence D. Jackel,et al.  Limits on Learning Machine Accuracy Imposed by Data Quality , 1995, KDD.

[12]  B. Marangelli Data preprocessing for adaptive vector quantization , 1991, Image Vis. Comput..

[13]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[14]  Glenn Murphy,et al.  Similitude in engineering , 1950 .

[15]  Hans-Jürgen Zimmermann,et al.  Fuzzy data analysis: methods and industrial applications , 1994 .

[16]  Alastair D. McAulay,et al.  Wavelet data compression for neural network preprocessing , 1992, Defense, Security, and Sensing.

[17]  Larry A. Rendell,et al.  Feature construction: an analytic framework and an application to decision trees , 1990 .

[18]  M. Milosavljevic,et al.  On the influence of the training set data preprocessing on neural networks training , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[19]  Nada Lavrac,et al.  The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains , 1986, AAAI.

[20]  Subbarayan Pasupathy,et al.  Application of Kalman filtering to real-time preprocessing of geophysical data , 1992, IEEE Trans. Geosci. Remote. Sens..

[21]  M. Ohta,et al.  A NEW STATE ESTIMATION METHOD WITH PRE-FIXED ALGORITHMIC FORM MATCHED TO EFFECTIVE DATA PROCESSING , 1992 .

[22]  Jack L. Meador,et al.  Data driven neural-based measurement discrimination for IC parametric faults diagnosis , 1992, Digest of Papers. 1992 IEEE VLSI Test Symposium.

[23]  Michael J. Piovoso,et al.  PROCESS MONITORING, DATA ANALYSIS AND DATA INTERPRETATION , 1996 .

[24]  Moonis Ali,et al.  MLS, a machine learning system for engine fault diagnosis , 1988, IEA/AIE '88.

[25]  Jean-Philippe Thirion,et al.  Direct extraction of boundaries from computed tomography scans , 1994, IEEE Trans. Medical Imaging.

[26]  S. M. Yarling Time series modeling as an approach to automatic feedback control of robotic positioning errors , 1993, Proceedings of 15th IEEE/CHMT International Electronic Manufacturing Technology Symposium.

[27]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[28]  H. Zimmermann,et al.  Fuzzy Set Theory and Its Applications , 1993 .

[29]  Lambertus Hesselink,et al.  Research issues in vector and tensor field visualization , 1994, IEEE Computer Graphics and Applications.

[30]  H. W. Sorenson,et al.  Kalman filtering : theory and application , 1985 .

[31]  Daniel G. Bobrow,et al.  SOME PRINCIPLES OF MEMORY SCHEMATA , 1975 .

[32]  Hans Bandemer,et al.  Fuzzy Data Analysis , 1992 .

[33]  Nada Lavrac,et al.  Cost-Sensitive Feature Reduction Applied to a Hybrid Genetic Algorithm , 1996, ALT.

[34]  Shogo Nishida,et al.  Learning to Learn Decision Trees , 1992, AAAI.

[35]  H. Simon,et al.  Rediscovering Chemistry with the Bacon System , 1983 .

[36]  I.D. Meng,et al.  Using a digital signal processor as a data stream controller in digital subtraction angiography , 1991, Conference Record of the 1991 IEEE Nuclear Science Symposium and Medical Imaging Conference.

[37]  K. Ullemeyer,et al.  A note on preprocessing of diffraction pole-density data , 1994 .

[38]  Henk B. Verbruggen,et al.  Artificial Intelligence in Real-Time Control , 1992 .

[39]  John R. Anderson,et al.  MACHINE LEARNING An Artificial Intelligence Approach , 2009 .

[40]  Mahmood R. Azimi-Sadjadi,et al.  Detection and classification of buried dielectric anomalies using neural networks-further results , 1994 .

[41]  James J. Clark,et al.  Data Fusion for Sensory Information Processing Systems , 1990 .

[42]  Giulia Pagallo,et al.  Learning DNF by Decision Trees , 1989, IJCAI.

[43]  Claus-Rainer Rollinger,et al.  The Discovery of the Equator or Concept Driven Learning , 1983, IJCAI.

[44]  G. D. Tattersall,et al.  Pre-processing and visualisation of decision support data for enhanced machine classification , 1992 .

[45]  D. Bobrow,et al.  Representation and Understanding: Studies in Cognitive Science , 1975 .

[46]  Jocelyne Fayn,et al.  Development of a conceptual reference model for digital ECG data storage , 1991, [1991] Proceedings Computers in Cardiology.

[47]  Mateo Valero,et al.  Parallel computing and transputer applications , 1992 .