Local Rank Distance

Researchers have developed a wide variety of methods for string data, that can be applied with success in different fields such as computational biology, natural language processing and so on. Such methods range from clustering techniques used to analyze the phylogenetic trees of different organisms, to kernel methods used to identify authorship or native language from text. Results of such methods are not perfect and can always be improved. Some of these methods are based on a distance or similarity measure for strings, such as Hamming, Levenshtein, Kendall-tau, rank distance, or string kernel. This paper aims to introduce a new distance measure, termed Local Rank Distance (LRD), inspired from the recently introduced Local Patch Dissimilarity for images. Designed to conform to more general principles and adapted to DNA strings, LRD comes to improve over state of the art methods for phylogenetic analysis. This paper shows two applications of LRD. The first application is the phylogenetic analysis of mammals. Experiments show that phylogenetic trees produced by LRD are better or at least similar to those reported in the literature. The second application is to identify native language of English learners. By working at character level, the proposed method is completely language independent and theory neutral. In conclusion, LRD can be used as a general approach to measure string similarity, despite being designed for DNA.

[1]  Radu Tudor Ionescu,et al.  The Story of the Characters, the DNA and the Native Language , 2013, BEA@NAACL-HLT.

[2]  Liviu P. Dinu,et al.  Circular Rank Distance: A New Approach for Genomic Applications , 2011, 2011 22nd International Workshop on Database and Expert Systems Applications.

[3]  Liviu P. Dinu,et al.  Local Patch Dissimilarity for Images , 2012, ICONIP.

[4]  V. Y. Popov,et al.  Multiple genome rearrangement by swaps and by element duplications , 2007, Theor. Comput. Sci..

[5]  Liviu P. Dinu,et al.  Clustering Based on Rank Distance with Applications on DNA , 2012, ICONIP.

[6]  Chuan Yi Tang,et al.  An Efficient Algorithm for Sorting by Block-Interchanges and Its Application to the Evolution of Vibrio Species , 2005, J. Comput. Biol..

[7]  Dana Shapira,et al.  Large Edit Distance with Multiple Block Operations , 2003, SPIRE.

[8]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[9]  Liviu P. Dinu,et al.  An Efficient Rank Based Approach for Closest String and Closest Substring , 2012, PloS one.

[10]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[11]  William T. Freeman,et al.  The Patch Transform , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Wook Sung Kim,et al.  Abscess Transformation of Intracardiac Hematoma and Ventricular Rupture after Double‐Patch Repair of Postinfarction Ventricular Septal Defect , 2010, Journal of cardiac surgery.

[13]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[14]  Liviu P. Dinu,et al.  On the Syllabic Similarities of Romance Languages , 2005, CICLing.

[15]  Alexandru I. Tomescu,et al.  A Rank-Based Sequence Aligner with Applications in Phylogenetic Analysis , 2014, PloS one.

[16]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[17]  Liviu P. Dinu,et al.  A Low-complexity Distance for DNA Strings , 2006, Fundam. Informaticae.

[18]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[19]  Martin Chodorow,et al.  TOEFL11: A CORPUS OF NON‐NATIVE ENGLISH , 2013 .

[20]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[21]  Liviu P. Dinu,et al.  Authorship Identification of Romanian Texts with Controversial Paternity , 2008, LREC.

[22]  Markus Chimani,et al.  A Closer Look at the Closest String and Closest Substring Problem , 2011, ALENEX.

[23]  S. Salzberg,et al.  DNA sequence of both chromosomes of the cholera pathogen Vibrio cholerae , 2000, Nature.

[24]  S. Nelson,et al.  BFAST: An Alignment Tool for Large Scale Genome Resequencing , 2009, PloS one.

[25]  Andreas Wilke,et al.  phylogenetic and functional analysis of metagenomes , 2022 .

[26]  Masahira Hattori,et al.  Genome sequence of Vibrio parahaemolyticus: a pathogenic mechanism distinct from that of V cholerae , 2003, The Lancet.

[27]  Adam Finkelstein,et al.  The PatchMatch randomized matching algorithm for image manipulation , 2011, Commun. ACM.

[28]  Florin Manea,et al.  An efficient approach for the rank aggregation problem , 2006, Theor. Comput. Sci..

[29]  S. Pääbo,et al.  Conflict Among Individual Mitochondrial Proteins in Resolving the Phylogeny of Eutherian Orders , 1998, Journal of Molecular Evolution.

[30]  Liviu P. Dinu,et al.  Clustering based on median and closest string via rank distance with applications on DNA , 2013, Neural Computing and Applications.

[31]  Liviu P. Dinu On the Classification and Aggregation of Hierarchies with Different Constitutive Elements , 2003, Fundam. Informaticae.

[32]  Radu Tudor Ionescu,et al.  Speeding Up Local Patch Dissimilarity , 2013, ICIAP.

[33]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[34]  Shih-Feng Tsai,et al.  Comparative genome analysis of Vibrio vulnificus, a marine pathogen. , 2003, Genome research.

[35]  Joel R. Tetreault,et al.  A Report on the First Native Language Identification Shared Task , 2013, BEA@NAACL-HLT.

[36]  C. Gissi,et al.  Where do rodents fit? Evidence from the complete mitochondrial genome of Sciurus vulgaris. , 2000, Molecular biology and evolution.

[37]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[38]  Alberto Policriti,et al.  rNA: a fast and accurate short reads numerical aligner , 2012, Bioinform..

[39]  M M Miyamoto,et al.  Higher-primate phylogeny--why can't we decide? , 1988, Molecular biology and evolution.

[40]  Páll Melsted,et al.  Efficient counting of k-mers in DNA sequences using a bloom filter , 2011, BMC Bioinformatics.

[41]  Aoife Cahill,et al.  Can characters reveal your native language? A language-independent approach to native language identification , 2014, EMNLP.

[42]  Simon Günter,et al.  Short Text Authorship Attribution via Sequence Kernels, Markov Chains and Author Unmasking: An Investigation , 2006, EMNLP.

[43]  Cristian Grozea,et al.  Kernel Methods and String Kernels for Authorship Analysis , 2012, CLEF.

[44]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[45]  Steven Skiena,et al.  Pattern matching with address errors: rearrangement distances , 2006, SODA 2006.

[46]  Jens Stoye,et al.  metaBEETL: high-throughput analysis of heterogeneous microbial populations from shotgun DNA sequences , 2013, BMC Bioinformatics.