Use of Kolmogorov distance identification of web page authorship , topic and domain

Recently there has been an upsurge in interest in the use of information entropy measures for identification of similarities and differences between strings. Strings include text document languages, computer programs and biological sequences. This work deals with the use of this technique for author identification in online postings and the identification of WebPages that are related to each other. This approach appears to offer benefits in analysis of web documents without the need for domain specific parsing or document modeling.

[1]  S. Singhe,et al.  Neural networks and disputed authorship: new challenges , 1995 .

[2]  Sargur N. Srihari,et al.  Automatic handwriting recognition and writer matching on anthrax-related handwritten mail , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[3]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[4]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[5]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[6]  V. Loreto,et al.  Data compression and learning in time sequences analysis , 2002, cond-mat/0207321.

[7]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[8]  Ming Li,et al.  Clustering by compression , 2003, IEEE International Symposium on Information Theory, 2003. Proceedings..

[9]  Xin Chen,et al.  Shared information and program plagiarism detection , 2004, IEEE Transactions on Information Theory.

[10]  Sally Yeates Sedelow,et al.  The Computer in the Humanities and Fine Arts , 1970, CSUR.

[11]  Amanda Spink,et al.  Real life information retrieval: a study of user queries on the Web , 1998, SIGF.

[12]  Philip J. Sallis,et al.  Computer-mediated communication: experiments with e-mail readability , 2000, Inf. Sci..

[13]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[14]  Frederick Mosteller,et al.  Applied Bayesian and classical inference : the case of the Federalist papers , 1984 .

[15]  Andrei N. Kolmogorov,et al.  Logical basis for information theory and probability theory , 1968, IEEE Trans. Inf. Theory.

[16]  Curtis R. Cook,et al.  Programming style authorship analysis , 1989, CSC '89.

[17]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.