Spanned patterns for the logical analysis of data

In a finite dataset consisting of positive and negative observations represented as real valued n-vectors, a positive (negative) pattern is an interval in R^n with the property that it contains sufficiently many positive (negative) observations, and sufficiently few negative (positive) ones. A pattern is spanned if it does not include properly any other interval containing the same set of observations. Although large collections of spanned patterns can provide highly accurate classification models within the framework of the Logical Analysis of Data, no efficient method for their generation is currently known. We propose in this paper, an incrementally polynomial time algorithm for the generation of all spanned patterns in a dataset, which runs in linear time in the output; the algorithm resembles closely the Blake and Quine consensus method for finding the prime implicants of Boolean functions. The efficiency of the proposed algorithm is tested on various publicly available datasets. In the last part of the paper, we present the results of a series of computational experiments which show the high degree of robustness of spanned patterns.

[1]  Toshihide Ibaraki,et al.  An Implementation of Logical Analysis of Data , 2000, IEEE Trans. Knowl. Data Eng..

[2]  Ying Liu,et al.  The Maximum Box Problem and its Application to Data Analysis , 2002, Comput. Optim. Appl..

[3]  Peter L. Hammer,et al.  Disjunctive and conjunctive normal forms of pseudo-Boolean functions , 2000, Discret. Appl. Math..

[4]  Archie Blake Canonical expressions in Boolean algebra , 1938 .

[5]  Peter L. Hammer,et al.  Comprehensive vs. comprehensible classifiers in logical analysis of data , 2008, Discret. Appl. Math..

[6]  Y. Crama,et al.  Cause-effect relationships and partially defined Boolean functions , 1988 .

[7]  Willard Van Orman Quine,et al.  A Way to Simplify Truth Functions , 1955 .

[8]  Peter L. Hammer,et al.  Use of the Logical Analysis of Data Method for Assessing Long-Term Mortality Risk After Exercise Electrocardiography , 2002, Circulation.

[9]  Toshihide Ibaraki,et al.  Logical Analysis of Data , 2005 .

[10]  Mihalis Yannakakis,et al.  On Generating All Maximal Independent Sets , 1988, Inf. Process. Lett..

[11]  Peter L. Hammer,et al.  Accelerated algorithm for pattern detection in logical analysis of data , 2006, Discret. Appl. Math..

[12]  Toshihide Ibaraki,et al.  Logical analysis of numerical data , 1997, Math. Program..

[13]  Peter L. Hammer,et al.  Coronary Risk Prediction by Logical Analysis of Data , 2003, Ann. Oper. Res..

[14]  Peter L. Hammer,et al.  Pareto-optimal patterns in logical analysis of data , 2004, Discret. Appl. Math..

[15]  Peter L. Hammer,et al.  Consensus algorithms for the generation of all maximal bicliques , 2004, Discret. Appl. Math..

[16]  Wei-Yin Loh,et al.  A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms , 2000, Machine Learning.