论文信息 - Capturing Expression Using Linguistic Information

Capturing Expression Using Linguistic Information

Recognizing similarities between literary works for copyright infringement detection requires evaluating similarity in the expression of content. Copyright law protects expression of content; similarities in content alone are not enough to indicate infringement. Expression refers to the way people convey particular information; it captures both the information and the manner of its presentation. In this paper, we present a novel set of linguistically informed features that provide a computational definition of expression and that enable accurate recognition of individual titles and their paraphrases more than 80% of the time. In comparison, baseline features, e.g., tfidf-weighted keywords, function words, etc., give an accuracy of at most 53%. Our computational definition of expression uses linguistic features that are extracted from POS-tagged text using context-free grammars, without incurring the computational cost of full parsers. The results indicate that informative linguistic features do not have to be computationally prohibitively expensive to extract.

Boris Katz | Özlem Uzuner | Boris Katz | Özlem Uzuner

[1] C. B. Williams. Mendenhall's studies of word-length distribution in the works of Shakespeare and Bacon , 1975 .

[2] Eric Brill,et al. A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[3] Özlem Uzuner,et al. Content and expression-based copy recognition for intellectual property protection , 2003, DRM '03.

[4] I.N. Bozkurt,et al. Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.

[5] Dmitry V. Khmelev,et al. Using Literal and Grammatical Statistics for Authorship Attribution , 2001, Probl. Inf. Transm..

[6] Patrick Juola,et al. Authorship Attribution , 2008, Found. Trends Inf. Retr..

[7] Nicolas W. Hengartner,et al. Quantitative Analysis of Literary Styles , 2002 .

[8] F. Mosteller,et al. Inference in an Authorship Problem , 1963 .

[9] D. Alexander,et al. Some classes of verbs in english , 1964 .

[10] Boris Katz,et al. Using empirical methods for evaluating expression and content similarity , 2004, 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the.

[11] Boris Katz,et al. Exploiting Lexical Regularities in Designing Natural Language Systems , 1988, COLING.

[12] T C Mendenhall,et al. THE CHARACTERISTIC CURVES OF COMPOSITION. , 1887, Science.

[13] Barbara B. Levin,et al. English verb classes and alternations , 1993 .

[14] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[15] H. S. Sichel,et al. On a Distribution Representing Sentence‐Length in Written Prose , 1974 .

[16] Boris Katz,et al. Using English for Indexing and Retrieving , 1991 .

[17] B. Efron,et al. Did Shakespeare write a newly-discovered poem? , 1987 .

[18] Graeme Hirst,et al. Detecting Stylistic Inconsistencies in Collaborative Writing , 1996, The New Writing Environment.

[19] Ido Dagan,et al. A Corpus-Independent Feature Set for Style-Based Text Categorization , 2003 .

[20] Boris Katz,et al. Identifying expression fingerprints using linguistic information , 2005 .

[21] 김두식,et al. English Verb Classes and Alternations , 2006 .

[22] Beth Levin,et al. English Verb Classes and Alternations: A Preliminary Investigation , 1993 .