Soft Bigram distance for names matching

Background Bi-gram distance (BI-DIST) is a recent approach to measure the distance between two strings that have an important role in a wide range of applications in various areas. The importance of BI-DIST is due to its representational and computational efficiency, which has led to extensive research to further enhance its efficiency. However, developing an algorithm that can measure the distance of strings accurately and efficiently has posed a major challenge to many developers. Consequently, this research aims to design an algorithm that can match the names accurately. BI-DIST distance is considered the best orthographic measure for names identification; nevertheless, it lacks a distance scale between the name bigrams. Methods In this research, the Soft Bigram Distance (Soft-Bidist) measure is proposed. It is an extension of BI-DIST by softening the scale of comparison among the name Bigrams for improving the name matching. Different datasets are used to demonstrate the efficiency of the proposed method. Results The results show that Soft-Bidist outperforms the compared algorithms using different name matching datasets.

[1]  Heuiseok Lim,et al.  Neural spelling correction: translating incorrect sentences to correct sentences for multimedia , 2020, Multimedia Tools and Applications.

[2]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[3]  Thierry Lecroq,et al.  Handbook of Exact String Matching Algorithms , 2004 .

[4]  Bonnie Berger,et al.  Levenshtein Distance, Sequence Comparison and Biological Database Search , 2020, IEEE Transactions on Information Theory.

[5]  Stathes Hadjiefthymiades,et al.  An extended Q-gram algorithm for calculating the relevance factor of products in electronic marketplaces , 2013, Electron. Commer. Res. Appl..

[6]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[7]  Patrick A. V. Hall,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.

[8]  Abdulrakeeb M. Al-Ssulami Hybrid string matching algorithm with a pivot , 2015, J. Inf. Sci..

[9]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[10]  Bart Thijs,et al.  Using character n-grams to match a list of publications to references in bibliographic databases , 2016, Scientometrics.

[11]  Andreas Nürnberger,et al.  Evaluation of n-gram conflation approaches for Arabic text retrieval , 2009, J. Assoc. Inf. Sci. Technol..

[12]  Victoria Meyer,et al.  Name Matching and Identity Matching , 2013 .

[13]  Ujjwal Bhattacharya,et al.  Online Handwriting Recognition Using Levenshtein Distance Metric , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[14]  Peter Christen,et al.  A note on using the F-measure for evaluating record linkage algorithms , 2017, Statistics and Computing.

[15]  Maher Al-Sanabani,et al.  An Improved N-gram Distance for Names Matching , 2019, 2019 First International Conference of Intelligent Computing and Engineering (ICOICE).

[16]  Tony Rees,et al.  Taxamatch, an Algorithm for Near (‘Fuzzy’) Matching of Scientific Names in Taxonomic Databases , 2014, PloS one.

[18]  Grzegorz Kondrak,et al.  N-Gram Similarity and Distance , 2005, SPIRE.

[19]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[20]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[21]  Peter Christen,et al.  A Comparison of Personal Name Matching: Techniques and Practical Issues , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[22]  Maher Al-Sanabani,et al.  Designing an Accurate and Efficient Algorithm for Matching Arabic Names , 2019, 2019 First International Conference of Intelligent Computing and Engineering (ICOICE).

[23]  Walter Fuertes,et al.  A proposal of an entity name recognition algorithm to integrate governmental databases , 2016, 2016 Third International Conference on eDemocracy & eGovernment (ICEDEG).

[24]  Yulia Ledeneva,et al.  Soft Bigram Similarity to Identify Confusable Drug Names , 2019, MCPR.