Extended Subtree: A New Similarity Function for Tree Structured Data

Although several distance or similarity functions for trees have been introduced, their performance is not always satisfactory in many applications, ranging from document clustering to natural language processing. This research proposes a new similarity function for trees, namely Extended Subtree (EST), where a new subtree mapping is proposed. EST generalizes the edit base distances by providing new rules for subtree mapping. Further, the new approach seeks to resolve the problems and limitations of previous approaches. Extensive evaluation frameworks are developed to evaluate the performance of the new approach against previous proposals. Clustering and classification case studies utilizing three real-world and one synthetic labeled data sets are performed to provide an unbiased evaluation where different distance functions are investigated. The experimental results demonstrate the superior performance of the proposed distance function. In addition, an empirical runtime analysis demonstrates that the new approach is one of the best tree distance functions in terms of runtime efficiency.

[1]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[2]  Mohammed J. Zaki,et al.  LOGML: Log Markup Language for Web Usage Mining , 2001, WEBKDD.

[3]  David Buttler,et al.  A Short Survey of Document Structure Similarity Algorithms , 2004, International Conference on Internet Computing.

[4]  B. Magnini,et al.  Recognizing Textual Entailment with Tree Edit Distance Algorithms , 2005 .

[5]  Charu C. Aggarwal,et al.  Xproj: a framework for projected structural clustering of xml documents , 2007, KDD '07.

[6]  Pasi Fränti,et al.  Knee Point Detection in BIC for Detecting the Number of Clusters , 2008, ACIVS.

[7]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[8]  Kaizhong Zhang,et al.  Algorithms for the constrained editing distance between ordered labeled trees and related problems , 1995, Pattern Recognit..

[9]  Anthony K. H. Tung,et al.  Similarity evaluation on tree-structured data , 2005, SIGMOD '05.

[10]  Kouichi Hirata,et al.  A Tree Distance Function Based on Multi-sets , 2009, PAKDD Workshops.

[11]  Kaizhong Zhang,et al.  Finding similar consensus between trees: an algorithm and a distance hierarchy , 2001, Pattern Recognit..

[12]  Michael Hecker,et al.  Alternative Approach to Tree-Structured Web Log Representation and Mining , 2011, 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[13]  R. Cattell The Scree Test For The Number Of Factors. , 1966, Multivariate behavioral research.

[14]  Gabriel Valiente,et al.  An Efficient Bottom-Up Distance between Trees , 2001, SPIRE.

[15]  Krzysztof Jajuga,et al.  Fuzzy clustering with squared Minkowski distances , 2001, Fuzzy Sets Syst..

[16]  Stanley M. Selkow,et al.  The Tree-to-Tree Editing Problem , 1977, Inf. Process. Lett..

[17]  Elio Masciari,et al.  Fast detection of XML structural similarity , 2005, IEEE Transactions on Knowledge and Data Engineering.

[18]  Mohammed J. Zaki Efficiently mining frequent trees in a forest: algorithms and applications , 2005, IEEE Transactions on Knowledge and Data Engineering.

[19]  Richard C. H. Connor,et al.  A bounded distance metric for comparing tree structure , 2011, Inf. Syst..

[20]  Charu C. Aggarwal,et al.  XRules: An effective algorithm for structural classification of XML data , 2006, Machine Learning.

[21]  Tao Jiang,et al.  Alignment of Trees - An Alternative to Tree Edit , 1994, Theor. Comput. Sci..

[22]  X. Tolsa Principal values for the Cauchy integral and rectifiability , 2000 .

[23]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[24]  H. V. Jagadish,et al.  Evaluating Structural Similarity in XML Documents , 2002, WebDB.

[25]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[26]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[27]  Arie van Deursen,et al.  Invariant-Based Automatic Testing of Modern Web Applications , 2012, IEEE Transactions on Software Engineering.

[28]  Philip Bille,et al.  A survey on tree edit distance and related problems , 2005, Theor. Comput. Sci..

[29]  Kaizhong Zhang,et al.  Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems , 1989, SIAM J. Comput..

[30]  Sven Helmer,et al.  Measuring the Structural Similarity of Semistructured Documents Using Entropy , 2007, VLDB.

[31]  Siu-Ming Yiu,et al.  An efficient and scalable algorithm for clustering XML documents by structure , 2004, IEEE Transactions on Knowledge and Data Engineering.

[32]  James Miller,et al.  Centroidal Voronoi Tessellations- A New Approach to Random Testing , 2013, IEEE Transactions on Software Engineering.

[33]  David E. Irwin,et al.  Finding a "Kneedle" in a Haystack: Detecting Knee Points in System Behavior , 2011, 2011 31st International Conference on Distributed Computing Systems Workshops.

[34]  David Zhang,et al.  On kernel difference-weighted k-nearest neighbor classification , 2008, Pattern Analysis and Applications.

[35]  Charu C. Aggarwal,et al.  XRules: an effective structural classifier for XML data , 2003, KDD '03.

[36]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[37]  S. Y. Lu,et al.  A Tree-Matching Algorithm Based on Node Splitting and Merging , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Ali Mesbah,et al.  Automated cross-browser compatibility testing , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[39]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[40]  Kuo-Chung Tai,et al.  The Tree-to-Tree Correction Problem , 1979, JACM.