Harry: A Tool for Measuring String Similarity

Comparing strings and assessing their similarity is a basic operation in many application domains of machine learning, such as in information retrieval, natural language processing and bioinformatics. The practitioner can choose from a large variety of available similarity measures for this task, each emphasizing different aspects of the string data. In this article, we present Harry, a small tool specifically designed for measuring the similarity of strings. Harry implements over 20 similarity measures, including common string distances and string kernels, such as the Levenshtein distance and the Subsequence kernel. The tool has been designed with efficiency in mind and allows for multi-threaded as well as distributed computing, enabling the analysis of large data sets of strings. Harry supports common data formats and thus can interface with analysis environments, such as Matlab, Pylab and Weka.

[1]  Gunnar Rätsch,et al.  The SHOGUN Machine Learning Toolbox , 2010, J. Mach. Learn. Res..

[2]  Konrad Rieck,et al.  Sally: a tool for embedding strings in vector spaces , 2012, J. Mach. Learn. Res..

[3]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[4]  Karsten M. Borgwardt,et al.  Kernel Methods in Bioinformatics , 2011, Handbook of Statistical Bioinformatics.

[5]  Péter Gács,et al.  Information Distance , 1998, IEEE Trans. Inf. Theory.

[6]  Brijesh Joshi,et al.  Touching from a distance: website fingerprinting attacks and defenses , 2012, CCS.

[7]  Maria Jesus Martin,et al.  High-quality Protein Knowledge Resource: SWISS-PROT and TrEMBL , 2002, Briefings Bioinform..

[8]  Ulrich Bodenhofer,et al.  KeBABS: an R package for kernel-based analysis of biological sequences , 2015, Bioinform..

[9]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[10]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[11]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[12]  Gunnar Rätsch,et al.  ARTS: accurate recognition of transcription starts in human , 2006, ISMB.

[13]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[14]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[15]  H. Ross Principles of Numerical Taxonomy , 1964 .