Sparse Substring Pattern Set Discovery Using Linear Programming Boosting

In this paper, we consider finding a small set of substring patterns which classifies the given documents well. We formulate the problem as 1 norm soft margin optimization problem where each dimension corresponds to a substring pattern. Then we solve this problem by using LPBoost and an optimal substring discovery algorithm. Since the problem is a linear program, the resulting solution is likely to be sparse, which is useful for feature selection. We evaluate the proposed method for real data such as movie reviews.

[1]  Bo Pang,et al.  A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts , 2004, ACL.

[2]  Sebastian Nowozin,et al.  gBoost: a mathematical programming approach to graph classification and regression , 2009, Machine Learning.

[3]  Ayumi Shinohara,et al.  Discovering Best Variable-Length-Don't-Care Patterns , 2002, Discovery Science.

[4]  Alexander J. Smola,et al.  Fast Kernels for String and Tree Matching , 2002, NIPS.

[5]  Eiji Takimoto,et al.  Linear Programming Boosting by Column and Row Generation , 2009, Discovery Science.

[6]  Srinivas Aluru,et al.  Space efficient linear time construction of suffix arrays , 2005, J. Discrete Algorithms.

[7]  Choon Hui Teo,et al.  Fast and space efficient string kernels using suffix arrays , 2006, ICML.

[8]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[9]  Ayumi Shinohara,et al.  A Practical Algorithm to Find the Best Subsequence Patterns , 2000, Discovery Science.

[10]  Peter Sanders,et al.  Simple Linear Work Suffix Array Construction , 2003, ICALP.

[11]  Ayumi Shinohara,et al.  String Pattern Discovery , 2004, ALT.

[12]  Jun'ichi Tsujii,et al.  Text Categorization with All Substring Features , 2009, SDM.

[13]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[14]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[15]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[16]  Ayumi Shinohara,et al.  An O(N2) Algorithm for Discovering Optimal Boolean Pattern Pairs , 2004, IEEE ACM Trans. Comput. Biol. Bioinform..

[17]  Ayhan Demiriz,et al.  Linear Programming Boosting via Column Generation , 2002, Machine Learning.

[18]  Lucas Chi Kwong Hui,et al.  Color Set Size Problem with Application to String Matching , 1992, CPM.

[19]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[20]  Dong Kyue Kim,et al.  Linear-Time Construction of Suffix Arrays , 2003, CPM.

[21]  S. V. N. Vishwanathan,et al.  Entropy Regularized LPBoost , 2008, ALT.

[22]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[23]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[24]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[25]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[26]  M. Takeda,et al.  An O(N/sup 2/) algorithm for discovering optimal Boolean pattern pairs , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[27]  Gerhard Weikum,et al.  Fast logistic regression for text categorization with variable-length n-grams , 2008, KDD.