Intelligent Statistical Data Mining with Information Complexity and Genetic Algorithms

This paper develops a computationally feasible intelligent data mining and knowledge discovery technique that addresses the potentially daunting statistical and combinatorial problems presented by subset regression models. Our approach integrates novel statistical modelling procedures based on an information-theoretic measure of complexity. We form a three-way hybrid between: information measures of complexity, multiple regression models, and genetic algorith ms (GAs). We demonstrate our new approach using a simulated example and on a real data set to illustrate the versatility and the utility of the new approach.

[1]  Hans-Hermann Bock,et al.  Information and Entropy in Cluster Analysis , 1994 .

[2]  Solomon Kullback,et al.  Information Theory and Statistics , 1960 .

[3]  H. Akaike Factor analysis and AIC , 1987 .

[4]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[5]  J. K. Ghosh,et al.  Statistical information and likelihood : a collection of critical essays , 1989 .

[6]  John von Neumann,et al.  Theory Of Self Reproducing Automata , 1967 .

[7]  N. Mantel Why Stepdown Procedures in Variable Selection , 1970 .

[8]  Hamparsum Bozdogan,et al.  Subset selection in vector autoregressive models using the genetic algorithm with informational complexity as the fitness function , 1998 .

[9]  Richard E. Blahut,et al.  Principles and practice of information theory , 1987 .

[10]  A. Atkinson Subset Selection in Regression , 1992 .

[11]  G. Kitagawa,et al.  Akaike Information Criterion Statistics , 1988 .

[12]  R. R. Hocking Developments in Linear Regression Methodology: 1959–l982 , 1983 .

[13]  C. S. Wallace,et al.  Bayesian Estimation of the Von Mises Concentration Parameter , 1996 .

[14]  H. Bozdogan Model selection and Akaike's Information Criterion (AIC): The general theory and its analytical extensions , 1987 .

[15]  H. Chernoff LARGE-SAMPLE THEORY: PARAMETRIC CASE' , 1956 .

[16]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[17]  Lalit M. Patnaik,et al.  Genetic algorithms: a survey , 1994, Computer.

[18]  D. E. Goldberg,et al.  Genetic Algorithms in Search , 1989 .

[19]  R. Weischedel,et al.  Optimal Subset Selection: Multiple Regression, Interdependence and Optimal Network Algorithms , 1974 .

[20]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[21]  C. S. Wallace,et al.  Estimation and Inference by Compact Coding , 1987 .

[22]  C. R. Rao,et al.  Minimum variance and the estimation of several parameters , 1947, Mathematical Proceedings of the Cambridge Philosophical Society.

[23]  Hamparsum Bozdogan,et al.  Multivariate Regressions, Genetic Algorithms, and Information Complexity: A Three Way Hybrid , 2002 .

[24]  R. R. Hocking The analysis and selection of variables in linear regression , 1976 .

[25]  C. R. Rao,et al.  Information and the Accuracy Attainable in the Estimation of Statistical Parameters , 1992 .

[26]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[27]  S Forrest,et al.  Genetic algorithms , 1996, CSUR.

[28]  S. T. Nichols,et al.  A New Approach to Model Structure Discrimination , 1980, IEEE Transactions on Systems, Man, and Cybernetics.

[29]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[30]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[31]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[32]  S. Sclove Application of model-selection criteria to some problems in multivariate analysis , 1987 .

[33]  Bruce M. Hill,et al.  Information for Estimating the Proportions in Mixtures of Exponential and Normal Distributions , 1963 .

[34]  van M.H. Emden,et al.  An analysis of complexity , 1971 .

[35]  Covert Bailey Smart Exercise: Burning Fat, Getting Fit , 1994 .

[36]  P. Gács,et al.  KOLMOGOROV'S CONTRIBUTIONS TO INFORMATION THEORY AND ALGORITHMIC COMPLEXITY , 1989 .

[37]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[38]  Zbigniew Michalewicz,et al.  Genetic Algorithms + Data Structures = Evolution Programs , 1996, Springer Berlin Heidelberg.

[39]  D. Haughton,et al.  Informational complexity criteria for regression models , 1998 .

[40]  H. Bozdogan On the information-based measure of covariance complexity and its application to the evaluation of multivariate linear models , 1990 .

[41]  L. Moses,et al.  Think and Explain with Statistics , 1988 .

[42]  Seppo Mustonen,et al.  A measure for total variability in multivariate normal distribution , 1997 .

[43]  A. N. Kolmogorov Combinatorial foundations of information theory and the calculus of probabilities , 1983 .

[44]  Solomon Kullback,et al.  Information Theory and Statistics , 1970, The Mathematical Gazette.

[45]  Satosi Watanabe,et al.  Pattern Recognition: Human and Mechanical , 1985 .

[46]  Peter A. Flach,et al.  Abduction and Induction , 2000 .

[47]  David R. Anderson,et al.  Model selection and inference : a practical information-theoretic approach , 2000 .

[48]  Stuart A. Kauffman,et al.  The origins of order , 1993 .

[49]  A. McQuarrie,et al.  Regression and Time Series Model Selection , 1998 .

[50]  J. Roughgarden Theory of Population Genetics and Evolutionary Ecology: An Introduction , 1995 .

[51]  A. Graham Nonnegative matrices and applicable topics in linear algebra , 1987 .

[52]  L. Ljung,et al.  On Canonical Forms, Parameter Identifiability and the Concept of Complexity , 1975 .

[53]  C. Radhakrishna Rao,et al.  Sufficient statistics and minimum variance estimates , 1949, Mathematical Proceedings of the Cambridge Philosophical Society.

[54]  H. White Maximum Likelihood Estimation of Misspecified Models , 1982 .

[55]  Hirotugu Akaike,et al.  Implications of Informational Point of View on the Development of Statistical Science , 1994 .

[56]  Albert R. Behnke,et al.  Evaluation and regulation of body build and composition , 1974 .

[57]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[58]  Rafael A. Irizarry,et al.  Information and Posterior Probability Criteria for Model Selection in Local Likelihood Estimation , 2001 .

[59]  J. Rissanen,et al.  Minmax Entropy Estimation of Models for Vector Processes , 1976 .

[60]  W. D. McArdle,et al.  Nutrition, weight control, and exercise , 1977 .

[61]  A. Houston,et al.  Genetic algorithms and evolution. , 1990, Journal of theoretical biology.

[62]  H. Bozdogan,et al.  Akaike's Information Criterion and Recent Developments in Information Complexity. , 2000, Journal of mathematical psychology.

[63]  Calyampudi Radhakrishna Rao,et al.  Linear Statistical Inference and its Applications , 1967 .

[64]  Salvatore D. Morgera,et al.  Information theoretic covariance complexity and its relation to pattern recognition , 1985, IEEE Transactions on Systems, Man, and Cybernetics.