A Novel Similarity Measure for Sequence Data

Abstract —A variety of different metrics has been introduced to measure the similarity of two given sequences. These widely used metrics are ranging from spell correctors and categorizers to new sequence mining applications. Different metrics consider different aspects of sequences, but the essence of any sequence is extracted from the ordering of its elements. In this paper, we propose a novel sequence similarity measure that is based on all ordered pairs of one sequence and where a Hasse diagram is built in the other sequence. In contrast with existing approaches, the idea behind the proposed sequence similarity metric is to extract all ordering features to capture sequence properties. We designed a clustering problem to evaluate our sequence similarity metric. Experimental results showed the superiority of our proposed sequence similarity metric in maximizing the purity of clustering compared to metrics such as d2, Smith-Waterman, Levenshtein, and Needleman-Wunsch. The limitation of those methods originates from some neglected sequence features, which are considered in our proposed sequence similarity metric.

[1]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[2]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[3]  Matthew A. Jaro,et al.  Probabilistic linkage of large public health data files. , 1995, Statistics in medicine.

[4]  Gregory Kucherov,et al.  YASS: enhancing the sensitivity of DNA similarity search , 2005, Nucleic Acids Res..

[5]  Weiguo Liu,et al.  Bio-sequence database scanning on a GPU , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[6]  Jian Pei,et al.  Sequence Data Mining , 2007, Advances in Database Systems.

[7]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[8]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[9]  L. Infante,et al.  Hierarchical Clustering , 2020, International Encyclopedia of Statistical Science.

[10]  Darald J. Hartfiel,et al.  Markov Set-Chains , 1998 .

[11]  Yoo-Jin Moon,et al.  Typographical and Orthographical Spelling Error Correction , 2000, LREC.

[12]  Justin Zobel,et al.  Phonetic string matching: lessons from information retrieval , 1996, SIGIR '96.

[13]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[14]  Jiong Yang,et al.  CLUSEQ: efficient and effective sequence clustering , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[15]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[16]  Victoria J. Hodge,et al.  An Evaluation of Phonetic Spell Checkers , 2001 .

[17]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[18]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[19]  Monojit Choudhury,et al.  Isolated-word Error Correction for Partially Phonemic Languages using Phonetic Cues , 2004 .

[20]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[21]  Jiangsheng Yu,et al.  Growth of Functions , 2019, Primers in Electronics and Computer Science.

[22]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[23]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[24]  William E. Winkler,et al.  String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. , 1990 .

[25]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[26]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2009, Information Retrieval.

[27]  Justin Zobel,et al.  Finding approximate matches in large lexicons , 1995, Softw. Pract. Exp..

[28]  Victoria J. Hodge,et al.  A Novel Binary Spell Checker , 2001, ICANN.

[29]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[30]  Li Yujian,et al.  A Normalized Levenshtein Distance Metric , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Roger Mitton Ordering the suggestions of a spellchecker without using context , 2009, Nat. Lang. Eng..

[33]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.