论文信息 - Text reuse detection by keyword extraction for telegram channels

Text reuse detection by keyword extraction for telegram channels

Text reuse detection is the task of finding similar texts, which has many applications e.g. in plagiarism detection and analysis of information diffusion. The general approach to this problem is by detection of texts similarities in combination with other features such as time-stamp which can be used to specify the precedence of publishers e.g. to find the first publisher. In this article a method for finding similar texts has been proposed using keyword extraction which operates in linear time like LSH1 method. In addition, it supports dynamic inputs and does not depend on text vector dimensions. Our evaluations show, it has better performance in clustering quality measures and also run time.

[1] Kuo Zhang,et al. Keyword extraction based on tf/idf for Chinese news document , 2007, Wuhan University Journal of Natural Sciences.

[2] Clayton Fink,et al. Complex contagions and the diffusion of popular Twitter hashtags in Nigeria , 2015, Social Network Analysis and Mining.

[3] Timo Honkela,et al. A Language-Independent Approach to Keyphrase Extraction and Evaluation , 2008, COLING.

[4] David A. Smith,et al. Detecting and modeling local text reuse , 2014, IEEE/ACM Joint Conference on Digital Libraries.

[5] Norman Meuschke,et al. State-of-the-art in detecting academic plagiarism , 2013 .

[6] W. Bruce Croft,et al. Similarity measures for tracking information flow , 2005, CIKM '05.

[7] David A. Smith,et al. Infectious texts: Modeling text reuse in nineteenth-century newspapers , 2013, 2013 IEEE International Conference on Big Data.

[8] Anette Hulth,et al. Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[9] Iryna Gurevych,et al. Text Reuse Detection using a Composition of Text Similarity Measures , 2012, COLING.