Good edit similarity learning by loss minimization

Similarity functions are a fundamental component of many learning algorithms. When dealing with string or tree-structured data, measures based on the edit distance are widely used, and there exist a few methods for learning them from data. However, these methods offer no theoretical guarantee as to the generalization ability and discriminative power of the learned similarities. In this paper, we propose an approach to edit similarity learning based on loss minimization, called GESL. It is driven by the notion of (ϵ,γ,τ)-goodness, a theory that bridges the gap between the properties of a similarity function and its performance in classification. Using the notion of uniform stability, we derive generalization guarantees that hold for a large class of loss functions. We also provide experimental results on two real-world datasets which show that edit similarities learned with GESL induce more accurate and sparser classifiers than other (standard or learned) edit similarities.

[1]  Guillaume Obozinski,et al.  Sparse methods for machine learning Theory and algorithms , 2012 .

[2]  Kousha Etessami,et al.  Recursive Markov chains, stochastic grammars, and monotone systems of nonlinear equations , 2005, JACM.

[3]  Horst Bunke,et al.  A probabilistic approach to learning costs for graph edit distance , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[4]  Rong Jin,et al.  Distance Metric Learning: A Comprehensive Survey , 2006 .

[5]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[6]  Amaury Habrard,et al.  Learning Rational Stochastic Languages , 2006, COLT.

[7]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[8]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[9]  Maria-Florina Balcan,et al.  On a theory of learning with similarity functions , 2006, ICML.

[10]  Maria-Florina Balcan,et al.  Improved Guarantees for Learning via Similarity Functions , 2008, COLT.

[11]  Rong Jin,et al.  Regularized Distance Metric Learning: Theory and Algorithm , 2009, NIPS.

[12]  R. Tibshirani,et al.  �-norm Support Vector Machines , 2003 .

[13]  Andrew Zisserman,et al.  Advances in Neural Information Processing Systems (NIPS) , 2007 .

[14]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Lorenzo Rosasco,et al.  Are Loss Functions All the Same? , 2004, Neural Computation.

[16]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[17]  Marc Sebban,et al.  Learning probabilistic models of tree edit distance , 2008, Pattern Recognit..

[18]  Atsuhiro Takasu Bayesian Similarity Model Estimation for Approximate Recognized Text Search , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[19]  B. Bollobás Surveys in Combinatorics , 1979 .

[20]  Herbert Freeman,et al.  Computer Processing of Line-Drawing Images , 1974, CSUR.

[21]  Liwei Wang,et al.  On learning with dissimilarity functions , 2007, ICML '07.

[22]  Marc Sebban,et al.  Learning state machine-based string edit kernels , 2010, Pattern Recognit..

[23]  Philip N. Klein,et al.  Computing the Edit-Distance between Unrooted Ordered Trees , 1998, ESA.

[24]  Marc Sebban,et al.  Learning Good Edit Similarities with Generalization Guarantees , 2011, ECML/PKDD.

[25]  Tatsuya Akutsu,et al.  Optimizing amino acid substitution matrices with a local alignment kernel , 2006, BMC Bioinformatics.

[26]  Dacheng Tao,et al.  Learning a Distance Metric by Empirical Loss Minimization , 2011, IJCAI.

[27]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[28]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[29]  Robert Tibshirani,et al.  1-norm Support Vector Machines , 2003, NIPS.

[30]  Marc Sebban,et al.  An Experimental Study on Learning with Good Edit Similarity Functions , 2011, 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence.

[31]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[32]  Inderjit S. Dhillon,et al.  Information-theoretic metric learning , 2006, ICML '07.

[33]  Jianzhong Li,et al.  The impact of sample imbalance on identifying differentially expressed genes , 2006, BMC Bioinformatics.

[34]  Amaury Habrard,et al.  Relevant Representations for the Inference of Rational Stochastic Tree Languages , 2008, ICGI.

[35]  Eiichi Tanaka,et al.  The Tree-to-Tree Editing Problem , 1988, Int. J. Pattern Recognit. Artif. Intell..

[36]  Marc Sebban,et al.  Learning stochastic edit distance: Application in handwritten character recognition , 2006, Pattern Recognit..

[37]  Yashar Mehdad,et al.  Automatic Cost Estimation for Tree Edit Distance Using Particle Swarm Optimization , 2009, ACL.

[38]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  S. Henikoff,et al.  Amino acid substitution matrices. , 2000, Advances in protein chemistry.

[40]  Yi Zhang,et al.  Unsupervised Learning of Tree Alignment Models for Information Extraction , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[41]  Philip Bille,et al.  A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[42]  Andrew McCallum,et al.  A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance , 2005, UAI.

[43]  Horst Bunke,et al.  A probabilistic approach to learning costs for graph edit distance , 2004, ICPR 2004.