Improving NCD accuracy by combining document segmentation and document distortion

Compression distances have been applied to a broad range of domains because of their parameter-free nature, wide applicability and leading efficacy. However, they have a characteristic that can be a drawback when applied under particular circumstances. Said drawback is that when they are used to compare two very different-sized objects, they do not consider them to be similar even if they are related by a substring relationship. This work focuses on addressing this issue when compression distances are used to calculate similarities between documents. The approach proposed in this paper consists of combining document segmentation and document distortion. On the one hand, it is proposed to use document segmentation to tackle the above mentioned drawback. On the other hand, it is proposed to use document distortion to help compression distances to obtain more reliable similarities. The results show that combining both techniques provides better results than not applying them or applying them separately. The said results are consistent across datasets of diverse nature.

[1]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[2]  Zhiguo Gong,et al.  Web image indexing by using associated texts , 2005, Knowledge and Information Systems.

[3]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[4]  W. John Wilbur,et al.  The automatic identification of stop words , 1992, J. Inf. Sci..

[5]  James Allan,et al.  Approaches to passage retrieval in full text information systems , 1993, SIGIR.

[6]  Alistair Moffat,et al.  Efficient Retrieval of Partial Documents , 1995, Inf. Process. Manag..

[7]  李明,et al.  New Information Distance Measure and Its Application in Question Answering System , 2008 .

[8]  Jonathan D. Hirst,et al.  Similarity by Compression , 2007, J. Chem. Inf. Model..

[9]  Oren Etzioni,et al.  Self-supervised Relation Extraction from the Web , 2006, ISMIS.

[10]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[11]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[12]  Wei-Ying Ma,et al.  Block-based web search , 2004, SIGIR '04.

[13]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[14]  Tat-Seng Chua,et al.  Mining dependency relations for query expansion in passage retrieval , 2006, SIGIR.

[15]  Manuel Cebrián,et al.  Contextual information retrieval based on algorithmic information theory and statistical outlier detection , 2007, 2008 IEEE Information Theory Workshop.

[16]  David Camacho,et al.  Is the contextual information relevant in text clustering by compression? , 2012, Expert Syst. Appl..

[17]  Xiaojun Wan,et al.  Beyond topical similarity: a structural similarity measure for retrieving highly similar documents , 2008, Knowledge and Information Systems.

[18]  Jerry M. Mendel,et al.  A vector similarity measure for linguistic approximation: Interval type-2 and type-1 fuzzy sets , 2008, Inf. Sci..

[19]  Grzegorz Kondrak,et al.  N-Gram Similarity and Distance , 2005, SPIRE.

[20]  Jimmy J. Lin,et al.  Quantitative evaluation of passage retrieval algorithms for question answering , 2003, SIGIR.

[21]  Humberto Bustince,et al.  Construction of fuzzy indices from fuzzy DI-subsethood measures: Application to the global comparison of images , 2007, Inf. Sci..

[22]  Manuel Cebrián,et al.  Evaluating the Impact of Information Distortion on Normalized Compression Distance , 2008, ICMCTA.

[23]  Manuel Cebrián,et al.  Reducing the Loss of Information through Annealing Text Distortion , 2011, IEEE Transactions on Knowledge and Data Engineering.

[24]  Sally Temple,et al.  Automatic Summarization of Changes in Biological Image Sequences Using Algorithmic Information Theory , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Kimmo Kettunen,et al.  Normalized Compression Distance Based Measures for MetricsMATR 2010 , 2010, WMT@ACL.

[26]  Justin Zobel,et al.  Passage retrieval revisited , 1997, SIGIR '97.

[27]  Peter Schäuble,et al.  Document and passage retrieval based on hidden Markov models , 1994, SIGIR '94.

[28]  James P. Callan,et al.  Passage-level evidence in document retrieval , 1994, SIGIR '94.

[29]  Stefan Axelsson,et al.  Similarity assessment for removal of noisy end user license agreements , 2011, Knowledge and Information Systems.

[30]  Yiming Yang,et al.  Noise reduction in a statistical approach to text categorization , 1995, SIGIR '95.

[31]  Mohamed S. Kamel,et al.  Document Similarity Using a Phrase Indexing Graph Model , 2003, Knowledge and Information Systems.

[32]  Christian Plaunt,et al.  Subtopic structuring for full-length document access , 1993, SIGIR.

[33]  Jörg Tiedemann,et al.  Simple is Best: Experiments with Different Document Segmentation Strategies for Passage Retrieval , 2008, COLING 2008.

[34]  David Salomon,et al.  Data Compression: The Complete Reference , 2006 .

[35]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[36]  Thanaruk Theeramunkong Applying passage in Web text mining , 2004, Int. J. Intell. Syst..

[37]  Tsachy Weissman,et al.  The Information Lost in Erasures , 2008, IEEE Transactions on Information Theory.

[38]  Mihai Datcu,et al.  A Model Conditioned Data Compression Based Similarity Measure , 2008, Data Compression Conference (dcc 2008).

[39]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[40]  David Salomon,et al.  Data Compression , 2000, Springer Berlin Heidelberg.

[41]  Humberto Bustince,et al.  Relationship between restricted dissimilarity functions, restricted equivalence functions and normal EN-functions: Image thresholding invariant , 2008, Pattern Recognit. Lett..

[42]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[43]  Tao Li,et al.  Using discriminant analysis for multi-class classification: an experimental investigation , 2006, Knowledge and Information Systems.

[44]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[45]  M MendelJerry,et al.  A vector similarity measure for linguistic approximation , 2008 .

[46]  Hui Xiong,et al.  Enhancing data analysis with noise removal , 2006, IEEE Transactions on Knowledge and Data Engineering.

[47]  R. Schiffer Psychobiology of Language , 1986 .