Measuring text similarity with dynamic time warping

In this work, we describe an approach which aims to make typed texts comparable with temporal data mining methods. This proposal was made in earlier work [11], but to our knowledge no significant research on this subject has been done yet. The basic idea is to derive artificial time series from texts by counting the occurrences of relevant keywords in a sliding window applied to them, and these time series can be compared with techniques of time series analysis. In this particular case the Dynamic Time Warping distance [3] was used. By extensive testing adequate parameters for time series calculation were derived, and we show that this approach might aid in the recognition of similar texts since the observed distances between similar documents are significantly lower than those between unrelated texts. Our idea might also be especially suitable for comparison in different languages since only the keyword translations must be known.

[1]  Eamonn J. Keogh,et al.  Finding Motifs in a Database of Shapes , 2007, SDM.

[2]  Philip Chan,et al.  Toward accurate dynamic time warping in linear time and space , 2007, Intell. Data Anal..

[3]  Parvati Iyer,et al.  Document Similarity Analysis for a Plagiarism Detection System , 2005, IICAI.

[4]  Eamonn J. Keogh,et al.  Everything you know about Dynamic Time Warping is Wrong , 2004 .

[5]  Eamonn J. Keogh,et al.  Iterative Deepening Dynamic Time Warping for Time Series , 2002, SDM.

[6]  Stan Salvador,et al.  FastDTW: Toward Accurate Dynamic Time Warping in Linear Time and Space , 2004 .

[7]  Fintan Culwin,et al.  Towards an error free plagarism detection process , 2001, ITiCSE.

[8]  Yuen-Yan Chan,et al.  A natural language processing approach to automatic plagiarism detection , 2007, SIGITE '07.

[9]  R. Manmatha,et al.  Word image matching using dynamic time warping , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[10]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[11]  Eamonn J. Keogh,et al.  Scaling up dynamic time warping for datamining applications , 2000, KDD '00.

[12]  Eamonn J. Keogh,et al.  Derivative Dynamic Time Warping , 2001, SDM.

[13]  Kyuseok Shim,et al.  Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases , 1995, VLDB.

[14]  Hidekazu Nakawatase,et al.  Calculating similarity between texts using graph-based text representation model , 2004, CIKM '04.

[15]  Mike Joy,et al.  Sentence-based natural language plagiarism detection , 2004, JERC.

[16]  Massimo Moneglia,et al.  Plagiarism Detection through Multilevel Text Comparison , 2006, 2006 Second International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution (AXMEDIS'06).

[17]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[18]  Donald J. Berndt,et al.  Using Dynamic Time Warping to Find Patterns in Time Series , 1994, KDD Workshop.