String similarity algorithms for a ticket classification system

Fuzzy string matching allows for close, but not exactly, matching strings to be compared and extracted from bodies of text. As such, they are useful in systems which automatically extract and process documents. We summarise and compare various existing algorithms for achieving string similarity measures: Longest Common Subsequence (LCS), Dice coefficient, Cosine Similarity, Levenshtein distance and Damerau distance. Based on previously classified customer support enquiries (tickets), we considered the effectiveness of different algorithms and configurations to automatically identify keywords of interest (such as error phrases, product names and warning messages) in instances where such key phrases are misspelled, copied incorrectly or are otherwise differently formed. An optimal algorithm selection is made based on novel studies of the aforementioned similarity measures on text strings tokenised into characters. Such analysis also allowed for an optimum similarity threshold to be identified for various categories of enquiries, to reduce mismatched strings whilst allowing optimal coverage of the correctly matched key phrases. This led to a 15% improvement in the ratio of false positives to true positive classifications over the existing approach used by a customer support system.

[1]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[2]  Douglas B. West,et al.  Mathematical Thinking: Problem-Solving and Proofs , 1996 .

[3]  Roy T. Fielding,et al.  Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content , 2014, RFC.

[4]  Bi Liu,et al.  A Normalized Levenshtein Distance Metric , 2007, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Alan O. Freier,et al.  Internet Engineering Task Force (ietf) the Secure Sockets Layer (ssl) Protocol Version 3.0 , 2022 .

[6]  Volodymyr Lyubinets,et al.  Automated Labeling of Bugs and Tickets Using Attention-Based Mechanisms in Recurrent Neural Networks , 2018, 2018 IEEE Second International Conference on Data Stream Mining & Processing (DSMP).

[7]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[8]  Paul V. Mockapetris,et al.  Domain names - concepts and facilities , 1987, RFC.

[9]  Robert L. Mercer,et al.  Context based spelling correction , 1991, Inf. Process. Manag..

[10]  Aixin Sun,et al.  Towards Effective Extraction and Linking of Software Mentions from User-Generated Support Tickets , 2018, CIKM.

[11]  Enrique Vidal,et al.  Computation of Normalized Edit Distance and Applications , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[13]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[14]  Tharindu Cyril Weerasooriya,et al.  A method to extract essential keywords from a tweet using NLP tools , 2016, 2016 Sixteenth International Conference on Advances in ICT for Emerging Regions (ICTer).

[15]  Grzegorz Kondrak,et al.  N-Gram Similarity and Distance , 2005, SPIRE.