论文信息 - Machine Learning Based on Attribute Interactions

Machine Learning Based on Attribute Interactions

Two attributes $A$ and $B$ are said to interact when it helps to observe the attribute values of both attributes together. This is an example of a $2$-way interaction. In general, a group of attributes ${\cal X}$ is involved in a $k$-way interaction when we cannot reconstruct their relationship merely with $\ell$-way interactions, $\ell < k$. These two definitions formalize the notion of an interaction in a nutshell. An additional notion is the one of context. We interpret context as just another attribute. There are two ways in which we can consider context. Context can be something that specifies our focus: we may examine interactions only in a given context, only for the instances that are in the context. Alternatively, context can be something that we are interested in: if we seek to predict weather, only the interactions involving the weather will be interesting to us. This is especially relevant for classification: we only want to examine the interactions involving the labelled class attribute and other attributes (unless there are missing or uncertain attribute values). But the definitions are not complete. We need to specify the model that assumes the interaction: how to we represent the pattern of co-appearance of several attributes? We also need to specify a model that does not assume the interaction: how do we reconstruct the pattern of co-appearance of several attributes without actually observing them all simultaneously? We need to specify a loss function that measures how good a particular model is, with respect to another model or with respect to the data. We need an algorithm that builds both models from the data. Finally, we need the data in order to assess whether it supports the hypothesis of interaction. The present work shows that mutual information, information gain, correlation, attribute importance, association and many other concepts, are all merely special cases of the above principle. Furthermore, the analysis of interactions generalizes the notions of analysis of variance, variable clustering, structure learning of Bayesian networks, and several other problems. There is an intriguing history of reinvention in the area of information theory on the topic of interactions. In our work, we focus on models founded on probability theory, and employ entropy and Kullback-Leibler divergence as our loss functions. Generally, whether an interaction exists or not, and to what extent, depends on what kind of models we are working with. The concept of McGill's interaction information in information theory, for example, is based upon Kullback-Leibler divergence as the loss function, and non-normalized Kirkwood superposition approximation models. Pearson's correlation coefficient is based on the proportion of explained standard deviation as the loss function, and on the multivariate Gaussian model. Most applications of mutual information are based on Kullback-Leibler divergence and the multinomial model. When there is a limited amount of data, it becomes unclear what model can be used to interpret it. Even if we fix the family of models, we remain uncertain about what would be the best choice of a model in the family. In all, uncertainty pervades the choice of the model. The underlying idea of Bayesian statistics is that the uncertainty about the model is to be handled in the same was as the uncertainty about the correct prediction in nondeterministic domains. The uncertainty, however, implies that we know neither if is an interaction with complete certainty, nor how important is the interaction. We propose a Bayesian approach to performing significance tests: an interaction is significant if it is very unlikely that a model assuming the interaction would suffer a greater loss than a model not assuming it, even if the interaction truly exists, among all the foreseeable posterior models. We also propose Bayesian confidence intervals to assess the probability distribution of the expected loss of assuming that an interaction does not exist. We compare significance tests based on permutations, bootstrapping, cross-validation, Bayesian statistics and asymptotic theory, and find that they often disagree. It is important, therefore, to understand the assumptions that underlie the tests. Interactions are a natural way of understanding the regularities in the data. We propose interaction analysis, a methodology for analyzing the data. It has a long history, but our novel contribution is a series of diagrams that illustrate the discovered interactions in data. The diagrams include information graphs, interaction graphs and dendrograms. We use interactions to identify concept drift and ignorability of missing data. We use interactions to cluster attribute values and build taxonomies automatically. When we say that there is an interaction, we still need to explain what it looks like. Generally, the interaction can be explained by inferring a higher-order construction. For that purpose, we provide visualizations for several models that allow for interactions. We also provide a probabilistic account of rule inference: a rule can be interpreted as a constructed attribute. We also describe interactions involving individual attribute values with other attributes: this can help us break complex attributes down into simpler components. We also provide an approach to handling the curse of dimensionality: we dynamically maintain a structure of attributes as individual attributes are entering our model one by one. We conclude this work by presenting two practical algorithms: an efficient heuristic for selecting attributes within the naive Bayesian classifier, and a complete approach to prediction with interaction models, the Kikuchi-Bayes model. Kikuchi-Bayes combines Bayesian model averaging, a parsimonious prior, and search for interactions that determine the model. Kikuchi-Bayes outperforms most popular machine learning methods, such as classification trees, logistic regression, the naive Bayesian classifier, and sometimes even the support vector machines. However, Kikuchi-Bayes models are highly interpretable and can be easily visualized as interaction graphs.

Aleks Jakulin

[1] G. Faè,et al. The physical review , 1895 .

[2] R. Fisher. 001: On an Absolute Criterion for Fitting Frequency Curves. , 1912 .

[3] R. Fisher. On the Interpretation of χ2 from Contingency Tables, and the Calculation of P , 2018, Journal of the Royal Statistical Society Series A (Statistics in Society).

[4] M. Bartlett. Contingency Table Interactions , 1935 .

[5] R. Fisher. THE FIDUCIAL ARGUMENT IN STATISTICAL INFERENCE , 1935 .

[6] R. Tolman,et al. The Principles of Statistical Mechanics. By R. C. Tolman. Pp. xix, 661. 40s. 1938. International series of monographs on physics. (Oxford) , 1939, The Mathematical Gazette.

[7] J. Kirkwood,et al. The Radial Distribution Function in Liquids , 1942 .

[8] G. Brier. VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[9] R. Kikuchi. A Theory of Cooperative Phenomena , 1951 .

[10] R. A. Leibler,et al. On Information and Sufficiency , 1951 .

[11] William J. McGill. Multivariate information transmission , 1954, Trans. IRE Prof. Group Inf. Theory.

[12] Ga Miller,et al. Note on the bias of information estimates , 1955 .

[13] Marvin A. Kastenbaum,et al. On the Hypothesis of No "Interaction" In a Multi-way Contingency Table , 1956 .

[14] E. Jaynes. Information Theory and Statistical Mechanics , 1957 .

[15] Michael Satosi Watanabe,et al. Information Theoretical Analysis of Multivariate Correlation , 1960, IBM J. Res. Dev..

[16] C. Rajski,et al. A Metric Space of Discrete Probability Distributions , 1961, Inf. Control..

[17] G. A. Barnard,et al. Transmission of Information: A Statistical Theory of Communications. , 1961 .

[18] J. Morgan,et al. Problems in the Analysis of Survey Data, and a Proposal , 1963 .

[19] Irving John Good,et al. The Estimation of Probabilities: An Essay on Modern Bayesian Methods , 1965 .

[20] Lotfi A. Zadeh,et al. Fuzzy Sets , 1996, Inf. Control..

[21] S. Kullback. Probability Densities with Given Marginals , 1968 .

[22] C. N. Liu,et al. Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[23] H. O. Lancaster. The chi-squared distribution , 1971 .

[24] W. Harris,et al. Traumatic arthritis of the hip after dislocation and acetabular fractures: treatment by mold arthroplasty. An end-result study using a new method of result evaluation. , 1969, The Journal of bone and joint surgery. American volume.

[25] M. Hinich,et al. An Expository Development of a Mathematical Model of the Electoral Process , 1970, American Political Science Review.

[26] J. J. Freeman. Note on approximating discrete probability distributions (Corresp.) , 1971, IEEE Trans. Inf. Theory.

[27] J. Darroch,et al. Generalized Iterative Scaling for Log-Linear Models , 1972 .

[28] A. H. Murphy. A New Vector Partition of the Probability Score , 1973 .

[29] J. Darroch. Multiplicative and additive interaction in contingency tables , 1974 .

[30] M. Stone. Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[31] Te Sun Han,et al. Linear Dependence Structure of the Entropy Space , 1975, Inf. Control..

[32] Andrew K. C. Wong,et al. Typicality, Diversity, and Feature Pattern of an Ensemble , 1975, IEEE Transactions on Computers.

[33] M. Degroot,et al. Probability and Statistics , 2021, Examining an Operational Approach to Teaching Probability.

[34] A. N. Tikhonov,et al. Solutions of ill-posed problems , 1977 .

[35] Te Sun Han. Nonnegative Entropy Measures of Multivariate Symmetric Correlations , 1978, Inf. Control..

[36] T. Speed,et al. Markov Fields and Log-Linear Interaction Models for Contingency Tables , 1980 .

[37] Te Sun Han,et al. Multiple Mutual Informations and Multiple Interactions in Frequency Data , 1980, Inf. Control..

[38] D. Rubin. The Bayesian Bootstrap , 1981 .

[39] I. Good. Good Thinking: The Foundations of Probability and Its Applications , 1983 .

[40] B. Efron,et al. A Leisurely Look at the Bootstrap, the Jackknife, and , 1983 .

[41] Leslie G. Valiant,et al. A theory of the learnable , 1984, CACM.

[42] A. P. Dawid,et al. Present position and potential developments: some personal views , 1984 .

[43] S. Salthe. Evolving Hierarchical Systems: Their Structure and Representation , 1985 .

[44] David L. Waltz,et al. Toward memory-based reasoning , 1986, CACM.

[45] J. Rissanen. Stochastic Complexity and Modeling , 1986 .

[46] D. A. Kenny,et al. The moderator-mediator variable distinction in social psychological research: conceptual, strategic, and statistical considerations. , 1986, Journal of personality and social psychology.

[47] Klaus Krippendorff,et al. Information Theory: Structural Models for Qualitative Data. , 1988 .

[48] Alen D. Shapiro,et al. Structured induction in expert systems , 1987 .

[49] Rubin Herman,et al. A WEAK SYSTEM OF AXIOMS FOR "RATIONAL" BEHAVIOR AND THE NONSEPARABILITY OF UTILITY FROM PRIOR , 1987 .

[50] Judea Pearl,et al. Probabilistic reasoning in intelligent systems , 1988 .

[51] J. N. Kapur. Maximum-entropy models in science and engineering , 1992 .

[52] Steven W. Norton. Generating Better Decision Trees , 1989, IJCAI.

[53] Bojan Cestnik,et al. Estimating Probabilities: A Crucial Task in Machine Learning , 1990, ECAI.

[54] Igor Kononenko,et al. Semi-Naive Bayesian Classifier , 1991, EWSL.

[55] A. Agresti,et al. Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[56] Wray L. Buntine. Classifiers: A Theoretical and Empirical Study , 1991, IJCAI.

[57] Thomas M. Cover,et al. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing) , 2006 .

[58] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[59] I. Csiszár. Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems , 1991 .

[60] Raymond W. Yeung,et al. A new outlook of Shannon's information measures , 1991, IEEE Trans. Inf. Theory.

[61] Donald Michie,et al. Use of sequential Bayes with class probability trees , 1991 .

[62] Ali S. Hadi,et al. Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[63] Larry A. Rendell,et al. A Practical Approach to Feature Selection , 1992, ML.

[64] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .

[65] Elie Bienenstock,et al. Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[66] David Haussler,et al. Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[67] Larry A. Rendell,et al. Lookahead Feature Construction for Learning Hard Concepts , 1993, International Conference on Machine Learning.

[68] Usama M. Fayyad,et al. Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[69] Stanley N. Salthe,et al. Development and Evolution: Complexity and Change in Biology , 1993 .

[70] C. Judd,et al. Statistical difficulties of detecting interactions and moderator effects. , 1993, Psychological bulletin.

[71] T. Tsujishita,et al. On Triple Mutual Information , 1994 .

[72] David H. Wolpert,et al. Estimating Functions of Distributions from A Finite Set of Samples, Part 2: Bayes Estimators for Mutual Information, Chi-Squared, Covariance and other Statistics , 1994, comp-gas/9403002.

[73] Malcolm R. Forster,et al. How to Tell When Simpler, More Unified, or Less Ad Hoc Theories will Provide More Accurate Predictions , 1994, The British Journal for the Philosophy of Science.

[74] Pat Langley,et al. Induction of Selective Bayesian Classifiers , 1994, UAI.

[75] F. Kianifard. Applied Multivariate Data Analysis: Volume II: Categorical and Multivariate Methods , 1994 .

[76] Michael J. Pazzani,et al. Searching for Dependencies in Bayesian Classifiers , 1995, AISTATS.

[77] David R. Wolf,et al. Estimating functions of probability distributions from a finite set of samples. , 1994, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[78] Ron Kohavi,et al. Wrappers for performance enhancement and oblivious decision graphs , 1995 .

[79] David B. Dunson,et al. Bayesian Data Analysis , 2010 .

[80] Carl M. Kadie,et al. SEER: maximum likelihood regression for learning-speed curves , 1995 .

[81] Christopher M. Bishop,et al. Current address: Microsoft Research, , 2022 .

[82] Y. Benjamini,et al. Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[83] Robert Tibshirani,et al. Discriminant Adaptive Nearest Neighbor Classification , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[84] Wray L. Buntine. A Guide to the Literature on Learning Probabilistic Networks from Data , 1996, IEEE Trans. Knowl. Data Eng..

[85] Bernhard Sendhoff,et al. How to Determine the Redundancy of Noisy Chaotic Time Series , 1996 .

[86] David Heckerman,et al. Knowledge Representation and Inference in Similarity Networks and Bayesian Multinets , 1996, Artif. Intell..

[87] Ron Kohavi,et al. Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[88] Daphne Koller,et al. Toward Optimal Feature Selection , 1996, ICML.

[89] R. Tibshirani. Regression Shrinkage and Selection via the Lasso , 1996 .

[90] P. Groenen,et al. Modern multidimensional scaling , 1996 .

[91] Adam L. Berger,et al. A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[92] Mia Hubert,et al. Integrating robust clustering techniques in S-PLUS , 1997 .

[93] Jonathan Baxter,et al. The Canonical Distortion Measure for Vector Quantization and Function Approximation , 1997, ICML.

[94] Paul M. B. Vitányi,et al. The miraculous universal distribution , 1997 .

[95] Larry A. Rendell,et al. Global Data Analysis and the Fragmentation Problem in Decision Tree Induction , 1997, ECML.

[96] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[97] Trevor J. Hastie,et al. Discriminative vs Informative Learning , 1997, KDD.

[98] Anthony C. Davison,et al. Bootstrap Methods and Their Application , 1998 .

[99] N. J. Cerf,et al. Entropic Bell inequalities , 1997 .

[100] Eduardo Perez. Learning despite complex attribute interaction: an approach based on relational operators , 1997 .

[101] I. Csiszár. Information theoretic methods in probability and statistics , 1997, Proceedings of IEEE International Symposium on Information Theory.

[102] John D. Lafferty,et al. Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[103] Andrew McCallum,et al. A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[104] P. Grünwald. The Minimum Description Length Principle and Reasoning under Uncertainty , 1998 .

[105] Dunja Mladenic,et al. Machine Learning on non-homogeneous, distributed text data , 1998 .

[106] H. Joe. Multivariate models and dependence concepts , 1998 .

[107] M. Studený,et al. The Multiinformation Function as a Tool for Measuring Stochastic Dependence , 1998, Learning in Graphical Models.

[108] Thorsten Joachims,et al. Making large scale SVM learning practical , 1998 .

[109] Ian H. Witten,et al. Using a Permutation Test for Attribute Selection in Decision Trees , 1998, ICML.

[110] Geoffrey E. Hinton,et al. A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[111] David R. Anderson,et al. Model Selection and Multimodel Inference , 2003 .

[112] David A. Bell,et al. Learning Bayesian networks from data: An information-theory based approach , 2002, Artif. Intell..

[113] Huan Liu,et al. Fragmentation problem and automated feature construction , 1998, Proceedings Tenth IEEE International Conference on Tools with Artificial Intelligence (Cat. No.98CH36294).

[114] Alexander J. Smola,et al. Learning with kernels , 1998 .

[115] D. Margolis,et al. Validation of a melanoma prognostic model. , 1998, Archives of dermatology.

[116] Gene H. Golub,et al. Tikhonov Regularization and Total Least Squares , 1999, SIAM J. Matrix Anal. Appl..

[117] Gregory F. Cooper,et al. A Bayesian Network Classifier that Combines a Finite Mixture Model and a NaIve Bayes Model , 1999, UAI.

[118] B. Streitberg. Exploring interactions in high-dimensional tables: a bootstrap alternative to log-linear models , 1999 .

[119] Y. Benjamini,et al. Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics , 1999 .

[120] Michel Grabisch,et al. An axiomatic approach to the concept of interaction among players in cooperative games , 1999, Int. J. Game Theory.

[121] Martin Theus,et al. Visualizing Loglinear Models , 1999 .

[122] Charles R. Meyer,et al. Multi-variate Mutual Information for Registration , 1999, MICCAI.

[123] F. Mattt,et al. Conditional Independences among Four Random Variables Iii: Final Conclusion , 1999 .

[124] Adrian E. Raftery,et al. Bayesian model averaging: a tutorial (with comments by M. Clyde, David Draper and E. I. George, and a rejoinder by the authors , 1999 .

[125] Murray A. Jorgensen,et al. Theory & Methods: Mixture model clustering using the MULTIMIX program , 1999 .

[126] Ivan Bratko,et al. Learning by Discovering Concept Hierarchies , 1999, Artif. Intell..

[127] K. T. Poole,et al. Nonparametric Unfolding of Binary Choice Data , 2000, Political Analysis.

[128] Michael I. Jordan,et al. Learning with Mixtures of Trees , 2001, J. Mach. Learn. Res..

[129] J. Leeuw. Applications of Convex Analysis to Multidimensional Scaling , 2000 .

[130] Shun-ichi Amari,et al. Methods of information geometry , 2000 .

[131] J. Pearl. Causality: Models, Reasoning and Inference , 2000 .

[132] Vladimir N. Vapnik,et al. The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[133] G Tononi,et al. Theoretical neuroanatomy: relating anatomical and functional connectivity in graphs and cortical connection matrices. , 2000, Cerebral cortex.

[134] Tommi S. Jaakkola,et al. Tractable Bayesian learning of tree belief networks , 2000, Stat. Comput..

[135] Mark A. Hall,et al. Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[136] M. Escobar,et al. Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[137] W. Freeman,et al. Generalized Belief Propagation , 2000, NIPS.

[138] Naftali Tishby,et al. The information bottleneck method , 2000, ArXiv.

[139] Samuel Kaski,et al. Metrics that learn relevance , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[140] Matsuda,et al. Physical nature of higher-order mutual information: intrinsic correlations and frustration , 2000, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[141] William Bialek,et al. Synergy in a Neural Code , 2000, Neural Computation.

[142] Stephen J. Roberts,et al. Maximum certainty data partitioning , 2000, Pattern Recognit..

[143] Geoffrey J. McLachlan,et al. Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[144] Rish,et al. An analysis of data characteristics that affect naive Bayes performance , 2001 .

[145] Gal Chechik,et al. Group Redundancy Measures Reveal Redundancy Reduction in the Auditory Pathway , 2001, NIPS.

[146] Nolan McCarty,et al. The Hunt for Party Discipline in Congress , 2001, American Political Science Review.

[147] Peter Harremoës,et al. Maximum Entropy Fundamentals , 2001, Entropy.

[148] Francesco M. Malvestuto,et al. An implementation of the iterative proportional fitting procedure by propagation trees , 2001 .

[149] Koby Crammer,et al. On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[150] J R Beck,et al. Predicting Patient’s Long-Term Clinical Status after Hip Arthroplasty Using Hierarchical Decision Modelling and Data Mining , 2001, Methods of Information in Medicine.

[151] Ariel Caticha. Maximum entropy, fluctuations and priors , 2001 .

[152] Stephen D. Bay. Multivariate Discretization for Set Mining , 2001, Knowledge and Information Systems.

[153] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[154] Aki Vehtari. Discussion to "Bayesian measures of model complexity and fit" by Spiegelhalter, D.J., Best, N.G., Carlin, B.P., and van der Linde, A. , 2002 .

[155] Naftali Tishby,et al. Unsupervised document classification using sequential information maximization , 2002, SIGIR '02.

[156] Wray L. Buntine. Variational Extensions to EM and Multinomial PCA , 2002, ECML.

[157] H. Belloc. The Free Press , 2002 .

[158] Charles X. Ling,et al. The Representational Power of Discrete Bayesian Networks , 2002, J. Mach. Learn. Res..

[159] Eamonn J. Keogh,et al. Learning the Structure of Augmented Bayesian Classifiers , 2002, Int. J. Artif. Intell. Tools.

[160] Aleks Jakulin. Attribute interactions in machine learning : master's thesis , 2002 .

[161] Raymond W. Yeung,et al. A First Course in Information Theory , 2002 .

[162] Henry Tirri,et al. B-Course: A Web-Based Tool for Bayesian and Causal Data Analysis , 2002, Int. J. Artif. Intell. Tools.

[163] Rob Malouf,et al. A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[164] X. Jin. Factor graphs and the Sum-Product Algorithm , 2002 .

[165] Rosa Meo. Maximum independence and mutual information , 2002, IEEE Trans. Inf. Theory.

[166] Henry Etzkowi,et al. The Triple Helix of University - Industry - Government , 2002 .

[167] William H. Press,et al. Numerical recipes in C , 2002 .

[168] Bradley P. Carlin,et al. Bayesian measures of model complexity and fit , 2002 .

[169] László Orlóci,et al. Biodiversity analysis: issues, concepts, techniques , 2002 .

[170] V. Vedral. The role of relative entropy in quantum information theory , 2001, quant-ph/0102094.

[171] H. Lynn. Suppression and Confounding in Action , 2003 .

[172] Matjaz Kukar,et al. Drifting Concepts as Hidden Factors in Clinical Studies , 2003, AIME.

[173] Nicolette de Keizer,et al. Integrating classification trees with local logistic regression in Intensive Care prognosis , 2003, Artif. Intell. Medicine.

[174] Thomas Wennekers,et al. Spatial and Temporal Stochastic Interaction in Neuronal Assemblies , 2003 .

[175] A. J. Bell. THE CO-INFORMATION LATTICE , 2003 .

[176] Ivan Bratko,et al. Attribute Interactions in Medical Data Analysis , 2003, AIME.

[177] Marina Meila,et al. Comparing Clusterings by the Variation of Information , 2003, COLT.

[178] Ramón López de Mántaras,et al. Tractable Bayesian Learning of Tree Augmented Naive Bayes Models , 2003, ICML.

[179] Interaktivna interakcijska analiza , 2003 .

[180] Ivan Kojadinovic,et al. Modeling interaction phenomena using fuzzy measures: on the notions of interaction and independence , 2003, Fuzzy Sets Syst..

[181] Aleks Jakulin,et al. Attribute Interactions in Machine Learning , 2003 .

[182] Ivan Bratko,et al. Analyzing Attribute Dependencies , 2003, PKDD.

[183] Eibe Frank,et al. Logistic Model Trees , 2003, ECML.

[184] John D. Storey. The positive false discovery rate: a Bayesian interpretation and the q-value , 2003 .

[185] Ricardo Vilalta,et al. A Decomposition of Classes via Clustering to Explain and Improve Naive Bayes , 2003, ECML.

[186] Michael J. Berry,et al. Network information and connected correlations. , 2003, Physical review letters.

[187] Ivan Bratko,et al. Quantifying and Visualizing Attribute Interactions: An Approach Based on Entropy , 2003 .

[188] Cynthia A. Brewer,et al. ColorBrewer in Print: A Catalog of Color Schemes for Maps , 2003 .

[189] Nello Cristianini,et al. Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[190] Contents , 2015, Neurobiology of Aging.

[191] Xiaojin Zhu,et al. Kernel conditional random fields: representation and clique selection , 2004, ICML.

[192] David W. Aha,et al. A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms , 1997, Artificial Intelligence Review.

[193] Bin Ma,et al. The similarity metric , 2001, IEEE Transactions on Information Theory.

[194] Samuel Kaski,et al. Sequential information bottleneck for finite data , 2004, ICML.

[195] Pedro Larrañaga,et al. Learning Recursive Bayesian Multinets for Data Clustering by Means of Constructive Induction , 2002, Machine Learning.

[196] Herbert K. H. Lee,et al. Lossless Online Bayesian Bagging , 2004, J. Mach. Learn. Res..

[197] Thomas D. Nielsen,et al. Latent variable discovery in classification models , 2004, Artif. Intell. Medicine.

[198] Vasant Honavar,et al. Generation of attribute value taxonomies from data for data-driven construction of accurate and compact classifiers , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[199] Shaul Markovitch,et al. Lookahead-based algorithms for anytime induction of decision trees , 2004, ICML.

[200] D. Ruppert. The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[201] Aleks Jakulin,et al. Applying Discrete PCA in Data Analysis , 2004, UAI.

[202] L. Leydesdorff,et al. The Triple Helix of university-industry-government relations , 2003, Scientometrics.

[203] Alex Alves Freitas,et al. Understanding the Crucial Role of Attribute Interaction in Data Mining , 2001, Artificial Intelligence Review.

[204] P. Kantor. Foundations of Statistical Natural Language Processing , 2001, Information Retrieval.

[205] Pedro M. Domingos,et al. Learning Bayesian network classifiers by maximizing conditional likelihood , 2004, ICML.

[206] Rich Caruana,et al. Ensemble selection from libraries of models , 2004, ICML.

[207] Pedro M. Domingos,et al. On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[208] David R. Brillinger,et al. Some data analyses using mutual information , 2004 .

[209] Hanna M. Wallach,et al. Conditional Random Fields: An Introduction , 2004 .

[210] Blaz Zupan,et al. Orange: From Experimental Machine Learning to Interactive Data Mining , 2004, PKDD.

[211] Marko Robnik-Sikonja,et al. Overcoming the Myopia of Inductive Learning Algorithms with RELIEFF , 2004, Applied Intelligence.

[212] Ilya Nemenman. Information theory, multivariate dependence, and genetic network inference , 2004, ArXiv.

[213] Thomas Hofmann,et al. Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[214] Stefan Kramer,et al. Ensembles of nested dichotomies for multi-class problems , 2004, ICML.

[215] Y. Mansour,et al. Generalization bounds for averaged classifiers , 2004, math/0410092.

[216] A. Dawid,et al. Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory , 2004, math/0410076.

[217] Gerhard Widmer,et al. Learning in the presence of concept drift and hidden contexts , 2004, Machine Learning.

[218] Ivo Düntsch,et al. On Model Evaluation, Indexes of Importance, and Interaction Values in Rough Set Analysis , 2004, Rough-Neural Computing: Techniques for Computing with Words.

[219] Robert Tibshirani,et al. The Entire Regularization Path for the Support Vector Machine , 2004, J. Mach. Learn. Res..

[220] J. Ross Quinlan,et al. Induction of Decision Trees , 1986, Machine Learning.

[221] R. Baierlein. Probability Theory: The Logic of Science , 2004 .

[222] Thomas Hofmann,et al. Exponential Families for Conditional Random Fields , 2004, UAI.

[223] Nir Friedman,et al. Bayesian Network Classifiers , 1997, Machine Learning.

[224] F. Fleuret. Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[225] Anna Goldenberg,et al. Tractable learning of large Bayes net structures from sparse data , 2004, ICML.

[226] Marko Robnik-Sikonja,et al. Theoretical and Empirical Analysis of ReliefF and RReliefF , 2003, Machine Learning.

[227] I. Bratko,et al. Information-based evaluation criterion for classifier's performance , 2004, Machine Learning.

[228] Takeo Kanade,et al. Maximum Entropy for Collaborative Filtering , 2004, UAI.

[229] Aleks Jakulin. Modelling Modelled∗ , 2004 .

[230] Ivan Bratko,et al. Testing the significance of attribute interactions , 2004, ICML.

[231] E. Luciano,et al. Copula methods in finance , 2004 .

[232] David J. C. MacKay,et al. Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[233] Andrew Gelman,et al. Standard Voting Power Indexes Do Not Work: An Empirical Analysis , 2002, British Journal of Political Science.

[234] Peter Cheeseman,et al. On The Relationship between Bayesian and Maximum Entropy Inference , 2004 .

[235] Leo Breiman,et al. Bagging Predictors , 1996, Machine Learning.

[236] Marco Zaffalon,et al. Distribution of mutual information from complete and incomplete data , 2004, Comput. Stat. Data Anal..

[237] D. Haussler,et al. Boolean Feature Discovery in Empirical Learning , 1990, Machine Learning.

[238] Douglas H. Fisher,et al. Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[239] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[240] William T. Freeman,et al. Constructing free-energy approximations and generalized belief propagation algorithms , 2005, IEEE Transactions on Information Theory.

[241] Ramón López de Mántaras,et al. A distance-based attribute selection measure for decision tree induction , 1991, Machine Learning.

[242] Tony Jebara,et al. Machine learning: Discriminative and generative , 2006 .

[243] Sang Joon Kim,et al. A Mathematical Theory of Communication , 2006 .

[244] Nicholas Eriksson,et al. Polyhedral conditions for the nonexistence of the MLE for hierarchical log-linear models , 2006, J. Symb. Comput..

[245] Emden R. Gansner,et al. Drawing graphs with dot , 2006 .

[246] Persi Diaconis,et al. c ○ 2007 Society for Industrial and Applied Mathematics Dynamical Bias in the Coin Toss ∗ , 2022 .

[247] Flemming Topsøe,et al. Information Theory at the Service of Science , 2007 .