Maximizing Agreement with a Classification by Bounded or Unbounded Number of Associated Words

We study the efficient discovery of word-association patterns, defined by a sequence of strings and a proximity gap, from a collection of texts with binary labels. We present an algorithm that finds all d strings and k proximity word-association patterns that maximizes agreement with the labels. It runs in expected time complexity O(kd-1n logd+1 n) and O(kd-1n) space with the total length n of texts, if texts are uniformly random strings. We also show that the problem to find a best word-association pattern with arbitrarily many strings is MAX SNP-hard.

[1]  David S. Johnson,et al.  Approximation algorithms for combinatorial problems , 1973, STOC.

[2]  Mihalis Yannakakis,et al.  Optimization, Approximation, and Complexity Classes (Extended Abstract) , 1988, STOC 1988.

[3]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[4]  Ricardo A. Baeza-Yates,et al.  An Algorithm for String Matching with a Sequence of don't Cares , 1991, Inf. Process. Lett..

[5]  Yasuhiko Morimoto,et al.  Data mining using two-dimensional optimized association rules: scheme, algorithms, and visualization , 1996, SIGMOD '96.

[6]  Hiroki Arimura,et al.  A Fast Algorithm for Discovering Optimal String Patterns in Large Text Databases , 1998, ALT.

[7]  Dimitrios Gunopulos,et al.  Computing the Maximum Bichromatic Discrepancy with Applications to Computer Graphics and Machine Learning , 1996, J. Comput. Syst. Sci..

[8]  Wojciech Szpankowski,et al.  A Note on the Height of Suffix Trees , 1992, SIAM J. Comput..

[9]  Minoru Ito,et al.  A Linear-Time Algorithm for Computing Characteristic Strings , 1994, ISAAC.

[10]  Linda Sellie,et al.  Toward efficient agnostic learning , 1992, COLT '92.

[11]  Giorgio Ausiello,et al.  Theoretical Computer Science Approximate Solution of Np Optimization Problems * , 2022 .

[12]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[13]  Kaizhong Zhang,et al.  Combinatorial pattern discovery for scientific data: some preliminary results , 1994, SIGMOD '94.