Capturing Expression Using Linguistic Information

Recognizing similarities between literary works for copyright infringement detection requires evaluating similarity in the expression of content. Copyright law protects expression of content; similarities in content alone are not enough to indicate infringement. Expression refers to the way people convey particular information; it captures both the information and the manner of its presentation. In this paper, we present a novel set of linguistically informed features that provide a computational definition of expression and that enable accurate recognition of individual titles and their paraphrases more than 80% of the time. In comparison, baseline features, e.g., tfidf-weighted keywords, function words, etc., give an accuracy of at most 53%. Our computational definition of expression uses linguistic features that are extracted from POS-tagged text using context-free grammars, without incurring the computational cost of full parsers. The results indicate that informative linguistic features do not have to be computationally prohibitively expensive to extract.

[1]  C. B. Williams Mendenhall's studies of word-length distribution in the works of Shakespeare and Bacon , 1975 .

[2]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[3]  Özlem Uzuner,et al.  Content and expression-based copy recognition for intellectual property protection , 2003, DRM '03.

[4]  I.N. Bozkurt,et al.  Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.

[5]  Dmitry V. Khmelev,et al.  Using Literal and Grammatical Statistics for Authorship Attribution , 2001, Probl. Inf. Transm..

[6]  Patrick Juola,et al.  Authorship Attribution , 2008, Found. Trends Inf. Retr..

[7]  Nicolas W. Hengartner,et al.  Quantitative Analysis of Literary Styles , 2002 .

[8]  F. Mosteller,et al.  Inference in an Authorship Problem , 1963 .

[9]  D. Alexander,et al.  Some classes of verbs in english , 1964 .

[10]  Boris Katz,et al.  Using empirical methods for evaluating expression and content similarity , 2004, 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the.

[11]  Boris Katz,et al.  Exploiting Lexical Regularities in Designing Natural Language Systems , 1988, COLING.

[12]  T C Mendenhall,et al.  THE CHARACTERISTIC CURVES OF COMPOSITION. , 1887, Science.

[13]  Barbara B. Levin,et al.  English verb classes and alternations , 1993 .

[14]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[15]  H. S. Sichel,et al.  On a Distribution Representing Sentence‐Length in Written Prose , 1974 .

[16]  Boris Katz,et al.  Using English for Indexing and Retrieving , 1991 .

[17]  B. Efron,et al.  Did Shakespeare write a newly-discovered poem? , 1987 .

[18]  Graeme Hirst,et al.  Detecting Stylistic Inconsistencies in Collaborative Writing , 1996, The New Writing Environment.

[19]  Ido Dagan,et al.  A Corpus-Independent Feature Set for Style-Based Text Categorization , 2003 .

[20]  Boris Katz,et al.  Identifying expression fingerprints using linguistic information , 2005 .

[21]  김두식,et al.  English Verb Classes and Alternations , 2006 .

[22]  Beth Levin,et al.  English Verb Classes and Alternations: A Preliminary Investigation , 1993 .