Identifying expression fingerprints using linguistic information

This thesis presents a technology to complement taxation-based policy proposals aimed at addressing the digital copyright problem. The approach presented facilitates identification of intellectual property using expression fingerprints. Copyright law protects expression of content. Recognizing literary works for copyright protection requires identification of the expression of their content. The expression fingerprints described in this thesis use a novel set of linguistic features that capture both the content presented in documents and the manner of expression used in conveying this content. These fingerprints consist of both syntactic and semantic elements of language. Examples of the syntactic elements of expression include structures of embedding and embedded verb phrases. The semantic elements of expression consist of high-level, broad semantic categories. Syntactic and semantic elements of expression enable generation of models that correctly identify books and their paraphrases 82% of the time, providing a significant (approximately 18%) improvement over models that use tfidf-weighted keywords. The performance of models built with these features is also better than models created with standard features used in stylometry (e.g., function words), which yield an accuracy of 62%. In the non-digital world, copyright holders collect revenues by controlling distribution of their works. Current approaches to the digital copyright problem attempt to provide copyright holders with the same kind of control over distribution by employing Digital Rights Management (DRM) systems. However, DRM systems also enable copyright holders to control and limit fair use, to inhibit others' speech, and to collect private information about individual users of digital works. Digital tracking technologies enable alternate solutions to the digital copyright problem; some of these solutions can protect creative incentives of copyright holders in the absence of control over distribution of works. Expression fingerprints facilitate digital tracking even when literary works are DRM- and watermark-free, and even when they are paraphrased. As such, they enable metering popularity of works and make practicable solutions that encourage large-scale dissemination and unrestricted use of digital works and that protect the revenues of copyright holders, for example through taxation-based revenue collection and distribution systems, without imposing limits on distribution. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

[1]  Walter L. Smith Probability and Statistics , 1959, Nature.

[2]  George A. Miller,et al.  Length-Frequency Statistics for Written English , 1958, Inf. Control..

[3]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  D. Biber A typology of English texts , 1989 .

[5]  Jan Svartvik,et al.  A __ comprehensive grammar of the English language , 1988 .

[6]  Julie E. Cohen A Right to Read Anonymously: A Closer Look at , 1997 .

[7]  Ravi K. Sharma,et al.  Practical Challenges for Digital Watermarking Applications , 2002, EURASIP J. Adv. Signal Process..

[8]  Barbara B. Levin,et al.  English verb classes and alternations , 1993 .

[9]  Peter Bock,et al.  A Preliminary Statistical Investigation into the Impace of an N-Gram Analysis Approach Based on World Syntactic Categories Toward Text Author Classification , 2000 .

[10]  Pamela Samuelson,et al.  Why the Anti-Circumvention Regulations Need to be Revised , 1999 .

[11]  Edward J. Delp,et al.  Perceptual watermarks for digital images and video , 1999, Electronic Imaging.

[12]  Thea van der Geest,et al.  The New Writing Environment: Writers at Work in a World of Technology , 1996, The New Writing Environment.

[13]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[14]  C. B. Williams A NOTE ON THE STATISTICAL ANALYSIS OF SENTENCE-LENGTH AS A CRITERION OF LITERARY STYLE BY , 2008 .

[15]  Nicolas W. Hengartner,et al.  Quantitative Analysis of Literary Styles , 2002 .

[16]  F. Mosteller,et al.  Inference in an Authorship Problem , 1963 .

[17]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[18]  W. Butler,et al.  Limits of copyright , 1985 .

[19]  Emma Hansson,et al.  DRM : Digital Rights Management , 2001 .

[20]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[21]  Neil Weinstock Netanel Copyright and the First Amendment; What Eldred Misses - and Portends , 2004 .

[22]  P. J. Stone Thematic text analysis: new agendas for analyzing text content , 1997 .

[23]  Paul H. Garthwaite,et al.  Defeating the Homogeneity Assumption , 2004 .

[24]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[25]  Jessica Litman,et al.  Copyright Legislation and Technological Change , 1989 .

[26]  Özlem Uzuner,et al.  Content and expression-based copy recognition for intellectual property protection , 2003, DRM '03.

[27]  Mark A. Lemley,et al.  Reducing Digital Copyright Infringement Without Restricting Innovation , 2004 .

[28]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[29]  Julie E. Cohen DRM and privacy , 2003, CACM.

[30]  David G. Stork,et al.  Pattern Classification , 1973 .

[31]  Dmitry V. Khmelev,et al.  Using Literal and Grammatical Statistics for Authorship Attribution , 2001, Probl. Inf. Transm..

[32]  Edward F. Kelly,et al.  Computer recognition of English word senses , 1975 .

[33]  Xindong Wu Knowledge Acquisition from Databases , 1995 .

[34]  Andrew M. Odlyzko,et al.  Internet Pricing and the History of Communications , 2001, Comput. Networks.

[35]  I.N. Bozkurt,et al.  Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.

[36]  Fiona J. TweedieNovember Using Markov Chains for Identification of Writers , 2002 .

[37]  A Douglas,et al.  Promises to keep. , 1984, Thorax.

[38]  Michael Halliday,et al.  An Introduction to Functional Grammar , 1985 .

[39]  Mark A. Lemley Property, Intellectual Property, and Free Riding , 2004 .

[40]  Markus G. Kuhn,et al.  Attacks on Copyright Marking Systems , 1998, Information Hiding.

[41]  Dmitry V. Khmelev,et al.  Using Markov Chains for Identification of Writer , 2001, Lit. Linguistic Comput..

[42]  Patrick Juola,et al.  Authorship Attribution , 2008, Found. Trends Inf. Retr..

[43]  William W. Fisher Promises to Keep: Technology, Law, and the Future of Entertainment , 2007 .

[44]  Julie E. Cohen Copyright and the Jurisprudence of Self-Help , 1998 .

[45]  Hector Garcia-Molina,et al.  Building a scalable and accurate copy detection mechanism , 1996, DL '96.

[46]  Edward C. Walterscheid,et al.  The Nature of the Intellectual Property Clause: A Study in Historical Perspective , 2000 .

[47]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[48]  Anil K. Jain,et al.  39 Dimensionality and sample size considerations in pattern recognition practice , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[49]  D. Alexander,et al.  Some classes of verbs in english , 1964 .

[50]  William W. Fisher iTunes: How Copyright, Contract, and Technology Shape the Business of Digital Media , 2004 .

[51]  N. W. Netanel Impose noncommercial Use Levy to Allow Free P2P File-Swapping and Remixing , 2002 .

[52]  R. H. Baayen,et al.  The CELEX Lexical Database (CD-ROM) , 1996 .

[53]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[54]  Robert Tappan Morris,et al.  Tarzan: a peer-to-peer anonymizing network layer , 2002, CCS '02.

[55]  A. P. B. Sardinha Corpus linguistics - investigating language structure and use , 1999 .

[56]  Pamela Samuelson Digital Rights Management {and, or, vs.} the Law , 2003 .

[57]  Urs Gasser,et al.  iTunes: How Copyright, Contract, and Technology Shape the Business of Digital Media - A Case Study , 2004 .

[58]  Boris Katz,et al.  Exploiting Lexical Regularities in Designing Natural Language Systems , 1988, COLING.

[59]  Julie E. Cohen Overcoming Property: Does Copyright Trump Privacy? , 2002 .

[60]  Kathleen R. McKeown,et al.  SIMFINDER: A Flexible Clustering Tool for Summarization , 2001 .

[61]  Julie E. Cohen Some Reflections on Copyright Management Systems and Laws Designed to Protect Them , 1997 .

[62]  Michael Halliday,et al.  Cohesion in English , 1976 .

[63]  Joe Lee Davis,et al.  Criticism and Parody , 1951 .

[64]  Jessica Litman Digital Copyright , 2017 .

[65]  Vasa D. Mihailovich Geir Kjetsaa, Sven Gustavsson, Bengt Beckman, and Steinar Gil, The Authorship of The Quiet Don , 1985 .

[66]  Ingemar J. Cox,et al.  Digital Watermarking , 2003, Lecture Notes in Computer Science.

[67]  John Charles Baker,et al.  Pace: A Test of Authorship Based on the Rate at which New Words Enter an Author's Text , 1988 .

[68]  Julie E. Cohen,et al.  Fair Use Infrastructure for Copyright Management Systems , 2000 .

[69]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[70]  Yorick Wilks,et al.  A tractable machine dictionary as a resource for computational semantics , 1989 .

[71]  Gerard Salton,et al.  Automatic text decomposition using text segments and text themes , 1996, HYPERTEXT '96.

[72]  Robert E. Schapire,et al.  A Brief Introduction to Boosting , 1999, IJCAI.

[73]  A. Q. Morton The Authorship of Greek Prose , 1965 .

[74]  Kathleen Fischer Promises to Keep , 1992 .

[75]  Stephen Mooney,et al.  Digital Rights Management: Business and Technology , 2001 .

[76]  Vitaly Shmatikov,et al.  Information Hiding, Anonymity and Privacy: a Modular Approach , 2004, J. Comput. Secur..

[77]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[78]  김두식,et al.  English Verb Classes and Alternations , 2006 .

[79]  David Chaum,et al.  Electronic Mail, Return Address, and Digital Pseudonyms , 1981 .

[80]  Bernhard Plattner,et al.  Introducing MorphMix: peer-to-peer based anonymous Internet usage with collusion detection , 2002, WPES '02.

[81]  Chris Buckley,et al.  New Retrieval Approaches Using SMART: TREC 4 , 1995, TREC.

[82]  Paul F. Syverson,et al.  Hiding Routing Information , 1996, Information Hiding.

[83]  Eva I. Ejerhed,et al.  Finding Clauses in Unrestricted Text by Finitary and Stochastic Methods , 1988, ANLP.

[84]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[85]  Paul England,et al.  The Darknet and the Future of Content Protection , 2002, Digital Rights Management Workshop.

[86]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[87]  Graeme Hirst,et al.  Detecting Stylistic Inconsistencies in Collaborative Writing , 1996, The New Writing Environment.

[88]  Ido Dagan,et al.  A Corpus-Independent Feature Set for Style-Based Text Categorization , 2003 .

[89]  C. B. Williams Mendenhall's studies of word-length distribution in the works of Shakespeare and Bacon , 1975 .

[90]  Stefan Bechtold The Present and Future of Digital Rights Management , 2006, 2006 Second International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution (AXMEDIS'06).

[91]  G. Yule ON SENTENCE- LENGTH AS A STATISTICAL CHARACTERISTIC OF STYLE IN PROSE: WITH APPLICATION TO TWO CASES OF DISPUTED AUTHORSHIP , 1939 .

[92]  Beth Levin,et al.  English Verb Classes and Alternations: A Preliminary Investigation , 1993 .

[93]  Hector Garcia-Molina,et al.  SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[94]  Min-Yen Kan,et al.  Role of Verbs in Document Analysis , 1998, ACL.

[95]  Boris Katz,et al.  Using empirical methods for evaluating expression and content similarity , 2004, 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the.

[96]  Fabien A. P. Petitcolas,et al.  Digital Watermarking , 2003, Lecture Notes in Computer Science.

[97]  H. S. Sichel,et al.  On a Distribution Representing Sentence‐Length in Written Prose , 1974 .

[98]  Nils J. Nilsson,et al.  Artificial Intelligence , 1974, IFIP Congress.

[99]  Bruce A. Lehman Intellectual Property and the National Information Infrastructure: The Report of the Working Group on Intellectual Property Rights , 1995 .

[100]  Laura Hidalgo-Downing,et al.  Negation in discourse: a text world approach to Joseph Heller’s Catch-22 , 2000 .

[101]  Lawrence Lessig,et al.  Code and Other Laws of Cyberspace , 1999 .

[102]  Uma Suthersanen,et al.  Copyright and free speech : comparative and international analyses , 2005 .

[103]  Claude S. Brinegar,et al.  Mark Twain and the Quintus Curtius Snodgrass Letters: A Statistical Test of Authorship , 1963 .

[104]  T C Mendenhall,et al.  THE CHARACTERISTIC CURVES OF COMPOSITION. , 1887, Science.

[105]  Boris Katz,et al.  Using English for Indexing and Retrieving , 1991 .

[106]  B. Efron,et al.  Did Shakespeare write a newly-discovered poem? , 1987 .

[107]  Dennis Kügler,et al.  An Analysis of GNUnet and the Implications for Anonymous, Censorship-Resistant Networks , 2003, Privacy Enhancing Technologies.