Framework for syntactic string similarity measures

Abstract Similarity measure is an essential component of information retrieval, document clustering, text summarization, and question answering, among others. In this paper, we introduce a general framework of syntactic similarity measures for matching short text. We thoroughly analyze the measures by dividing them into three components: character-level similarity, string segmentation, and matching technique. Soft variants of the measures are also introduced. With the help of two existing toolkits (SecondString and SimMetric), we provide an open-source Java toolkit of the proposed framework, which integrates the individual components together so that completely new combinations can be created. Experimental results reveal that the performance of the similarity measures depends on the type of the dataset. For well-maintained dataset, using a token-level measure is important but the basic (crisp) variant is usually enough. For uncontrolled dataset where typing errors are expected, the soft variants of the token-level measures are necessary. Among all tested measures, a soft token-level measure that combines set matching and q-grams at the character level perform best. A gap between human perception and syntactic measures still remains due to lacking semantic analysis.

[1]  Iryna Gurevych,et al.  DKPro Similarity: An Open Source Framework for Text Similarity , 2013, ACL.

[2]  Kalervo Järvelin,et al.  Non-adjacent Digrams Improve Matching of Cross-Lingual Spelling Variants , 2003, SPIRE.

[3]  Christian Komusiewicz,et al.  Reversal Distances for Strings with Few Blocks or Small Alphabets , 2014, CPM.

[4]  Liangli Ma,et al.  A Comparative Evaluation of String Similarity Metrics for Ontology Alignment , 2015 .

[5]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[6]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[7]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[8]  Pasi Fränti,et al.  Matching Similarity for Keyword-Based Clustering , 2014, S+SSPR.

[9]  Alberto Barrón-Cedeño,et al.  Methods for cross-language plagiarism detection , 2013, Knowl. Based Syst..

[10]  Richard Millham,et al.  The Comparative Analysis of Smith-Waterman Algorithm with Jaro-Winkler Algorithm for the Detection of Duplicate Health Related Records , 2018, 2018 International Conference on Advances in Big Data, Computing and Data Communication Systems (icABCD).

[11]  C. Tappert,et al.  A Survey of Binary Similarity and Distance Measures , 2010 .

[12]  Ion Androutsopoulos,et al.  Learning Textual Entailment using SVMs and String Similarity Measures , 2007, ACL-PASCAL@ACL.

[13]  Sungjoo Lee,et al.  Keyword selection and processing strategy for applying text mining to patent analysis , 2015, Expert Syst. Appl..

[14]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[15]  William E. Winkler,et al.  String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. , 1990 .

[16]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[17]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[18]  Akira Maeda,et al.  Cross-Language Record Linkage based on Semantic Matching of Metadata , 2018 .

[19]  Panagiotis G. Ipeirotis,et al.  Duplicate Record Detection: A Survey , 2007 .

[20]  Grigori Sidorov,et al.  Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model , 2014, Computación y Sistemas.

[21]  Chakkrit Snae A Comparison and Analysis of Name Matching Algorithms , 2007 .

[22]  Robert H. Somers,et al.  A new asymmetric measure of association for ordinal variables. , 1962 .

[23]  Craig A. Knoblock,et al.  Unsupervised information extraction from unstructured, ungrammatical data sources on the World Wide Web , 2007, International Journal of Document Analysis and Recognition (IJDAR).

[24]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[25]  Chris Brew,et al.  Word-Pair Extraction for Lexicography , 1996 .

[26]  Konrad Rieck,et al.  Harry: A Tool for Measuring String Similarity , 2016, J. Mach. Learn. Res..

[27]  Serkan Günal,et al.  The impact of preprocessing on text classification , 2014, Inf. Process. Manag..

[28]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[29]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[30]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[31]  Fabio A. González,et al.  Text Comparison Using Soft Cardinality , 2010, SPIRE.

[32]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[33]  Pasi Fränti,et al.  Similarity measures for title matching , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[34]  Pasi Fränti,et al.  Using linguistic features to automatically extract web page title , 2017, Expert Syst. Appl..

[35]  Lei Chen,et al.  Probabilistic correlation-based similarity measure on text records , 2014, Inf. Sci..

[36]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[37]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[38]  Paolo Rosso,et al.  A systematic study of knowledge graph analysis for cross-language plagiarism detection , 2016, Inf. Process. Manag..

[39]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[40]  Xiaohua Hu,et al.  The Evaluation of Sentence Similarity Measures , 2008, DaWaK.

[41]  Susan T. Dumais,et al.  Similarity Measures for Short Segments of Text , 2007, ECIR.

[42]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[43]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[44]  Pasi Fränti,et al.  Centroid index: Cluster level similarity measure , 2014, Pattern Recognit..

[45]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[46]  Ryutaro Ichise,et al.  Resolving Range Violations in DBpedia , 2017, JIST.

[47]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[48]  Fabio A. González,et al.  Generalized Mongue-Elkan Method for Approximate Text String Comparison , 2009, CICLing.

[49]  Pascal Hitzler,et al.  String Similarity Metrics for Ontology Alignment , 2013, SEMWEB.

[50]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[51]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[52]  Azadeh Shakery,et al.  Candidate document retrieval for cross-lingual plagiarism detection using two-level proximity information , 2016, Inf. Process. Manag..

[53]  Max M. Louwerse,et al.  A Comparison of String Similarity Measures for Toponym Matching , 2013, COMP '13.

[54]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[55]  C Friedman,et al.  Tolerating spelling errors during patient validation. , 1992, Computers and biomedical research, an international journal.

[56]  Nan Niu,et al.  Assuring Virtual PLC in the Context of SysML Models , 2018, ICSR.

[57]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[58]  Pasi Fränti,et al.  Set Matching Measures for External Cluster Validity , 2016, IEEE Transactions on Knowledge and Data Engineering.

[59]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[60]  François Yvon,et al.  Robust Similarity Measures for Named Entities Matching , 2008, COLING.

[61]  Peter Christen,et al.  A Comparison of Personal Name Matching: Techniques and Practical Issues , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[62]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[63]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[64]  Mark P. J. van der Loo,et al.  The stringdist Package for Approximate String Matching , 2014, R J..