New designs for improving the efficiency and resilience of natural language watermarking

Contributing our own creativity (in the form of text, image, audio, and video) to the pool of online information is fast becoming an essential part of online experience. However, it is still an open question as to how we, as authors, can control the way that the information we create is distributed or re-used. Rights management problems are serious for text since it is particularly easy for other people to download and manipulate copyrighted text from the Internet and later re-use it free from control. There is a need for a rights protection system that "travels with the content". Digital watermarking is a mechanism that embeds the copyright information in the document. Besides traveling with the content of the documents, digital watermarks can also be imperceptible to the user, which makes the process of removing them from the document challenging. The goal of this thesis is to design practical and resilient natural language watermarking systems. I have designed and implemented several natural language watermarking algorithms that use the linguistic features of the cover text in order to embed information. Using linguistic features provides resilience through making the message an elemental part of the content of the text, and through the judicious use of ambiguity in the usage of natural language and richness of features of natural language constituents. In this thesis, I propose several practical and resilient natural language watermarking systems for a variety of genres of text (short, long, edited and cursory text) and analyze their resilience and feasibility. Significant by-products of this research are as follows: a protocol for improving the stealthiness of information hiding systems; systems for using the proposed information hiding mechanisms to solve the problems of private communication and phishing defense; analysis of the evaluation methodologies and detection techniques for information hiding systems that use natural language text as cover.

[1]  Andrei Popescu-Belis,et al.  Principles of Context-Based Machine Translation Evaluation , 2002, Machine Translation.

[2]  Jessica J. Fridrich,et al.  New methodology for breaking steganographic techniques for JPEGs , 2003, IS&T/SPIE Electronic Imaging.

[3]  Dilek Z. Hakkani-Tür,et al.  Natural language watermarking: challenges in building a practical system , 2006, Electronic Imaging.

[4]  Thomas D. Wu The Secure Remote Password Protocol , 1998, NDSS.

[5]  Phil Sallee,et al.  Model-Based Steganography , 2003, IWDW.

[6]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[7]  Stephan Katzenbeisser,et al.  Information Hiding Techniques for Steganography and Digital Watermaking , 1999 .

[8]  Claire Gardent,et al.  Improving Machine Learning Approaches to Coreference Resolution , 2002, ACL.

[9]  XTAG Research Group,et al.  A Lexicalized Tree Adjoining Grammar for English , 1998, ArXiv.

[10]  Regina Barzilay,et al.  Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment , 2003, NAACL.

[11]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[12]  Joseph A. O'Sullivan,et al.  Information-theoretic analysis of information hiding , 2003, IEEE Trans. Inf. Theory.

[13]  Steven H. Low,et al.  Copyright protection for the electronic distribution of text documents , 1999, Proc. IEEE.

[14]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[15]  Aviel D. Rubin,et al.  Publius: a robust, tamper-evident, censorship-resistant web publishing system , 2000 .

[16]  Nasir D. Memon,et al.  Protecting digital media content , 1998, CACM.

[17]  Mikhail J. Atallah,et al.  Information hiding through errors: a confusing approach , 2007, Electronic Imaging.

[18]  Sergei Nirenburg,et al.  Natural language processing for information assurance and security: an overview and implementations , 2001, NSPW '00.

[19]  Mark Chapman,et al.  Hiding the Hidden: A software system for concealing ciphertext as innocuous text , 1997, ICICS.

[20]  Patrick Traynor,et al.  Privacy Preserving Web-Based Email , 2006, ICISS.

[21]  John D. Lafferty,et al.  A Robust Parsing Algorithm for Link Grammars , 1995, IWPT.

[22]  Mikhail J. Atallah,et al.  The hiding virtues of ambiguity: quantifiably resilient watermarking of natural language text through synonym substitutions , 2006, MM&Sec '06.

[23]  Kenneth Ward Church,et al.  A Spelling Correction Program Based on a Noisy Channel Model , 1990, COLING.

[24]  Fei Xia,et al.  Converting Dependency Structures to Phrase Structures , 2001, HLT.

[25]  Mikhail J. Atallah,et al.  Words are not enough: sentence level natural language watermarking , 2006, MCPS '06.

[26]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[27]  Mikhail J. Atallah,et al.  Passwords decay, words endure: secure and re-usable multiple password mnemonics , 2007, SAC '07.

[28]  Jessica J. Fridrich,et al.  Practical steganalysis of digital images: state of the art , 2002, IS&T/SPIE Electronic Imaging.

[29]  Ja-Ling Wu,et al.  Attacking visible watermarking schemes , 2004, IEEE Transactions on Multimedia.

[30]  Jana Dittmann,et al.  Profiles for evaluation: the usage of audio WET , 2006, Electronic Imaging.

[31]  Niels Provos,et al.  Defending Against Statistical Steganalysis , 2001, USENIX Security Symposium.

[32]  Mikhail J. Atallah,et al.  A hierarchical protocol for increasing the stealthiness of steganographic methods , 2004, MM&Sec '04.

[33]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[34]  Neri Merhav,et al.  An Image Watermarking Scheme Based on Information Theoretic Principles , 2001 .

[35]  Mikhail J. Atallah,et al.  Lost in just the translation , 2006, SAC.

[36]  John C. Mitchell,et al.  Client-Side Defense Against Web-Based Identity Theft , 2004, NDSS.

[37]  Andreas Westfeld,et al.  F5-A Steganographic Algorithm , 2001, Information Hiding.

[38]  Andreas Pfitzmann,et al.  Attacks on Steganographic Systems , 1999, Information Hiding.

[39]  Jessica J. Fridrich,et al.  Higher-order statistical steganalysis of palette images , 2003, IS&T/SPIE Electronic Imaging.

[40]  Richard Bergmair,et al.  Towards Linguistic Steganography: A Systematic Investigation of Approaches, Systems, and Issues , 2004 .

[41]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[42]  Andreas Westfeld,et al.  F5—A Steganographic Algorithm High Capacity Despite Better Steganalysis , 2001 .

[43]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[44]  Anne Abeillé,et al.  A Lexicalized Tree Adjoining Grammar for English , 1990 .

[45]  Beth Levin,et al.  English Verb Classes and Alternations: A Preliminary Investigation , 1993 .

[46]  Ross J. Anderson,et al.  On the limits of steganography , 1998, IEEE J. Sel. Areas Commun..

[47]  Jessica J. Fridrich,et al.  Digital image steganography using stochastic modulation , 2003, IS&T/SPIE Electronic Imaging.

[48]  Peter Wayner,et al.  Mimic Functions , 1992, Cryptologia.

[49]  Carl Vogel,et al.  The syntax of concealment: reliable methods for plain text information hiding , 2007, Electronic Imaging.

[50]  L. Philips,et al.  Hanging on the metaphone , 1990 .

[51]  Adi Shamir,et al.  A method for obtaining digital signatures and public-key cryptosystems , 1978, CACM.

[52]  Walter Daelemans,et al.  Improving Accuracy in word class tagging through the Combination of Machine Learning Systems , 2001, CL.

[53]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[54]  Eric Brill,et al.  Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users , 2004, EMNLP.

[55]  Brian Murphy,et al.  Syntactic Information Hiding in Plain Text , 2001 .

[56]  Edward J. Delp,et al.  Benchmarking of image watermarking algorithms for digital rights management , 2004, Proceedings of the IEEE.

[57]  Benoit Lavoie,et al.  A Fast and Portable Realizer for Text Generation Systems , 1997, ANLP.

[58]  Bruce Schneier,et al.  Inside risks: risks of PKI: secure email , 2000, CACM.

[59]  Jessica J. Fridrich,et al.  Reliable detection of LSB steganography in color and grayscale images , 2001, MM&Sec '01.

[60]  Mark Chapman,et al.  Plausible Deniability Using Automated Linguistic Stegonagraphy , 2002, InfraSec.

[61]  Edward J. Delp,et al.  Attacks on lexical natural language steganography systems , 2006, Electronic Imaging.

[62]  Jessica J. Fridrich,et al.  Wet paper codes with improved embedding efficiency , 2006, IEEE Transactions on Information Forensics and Security.

[63]  김인택 [서평]「Information Hiding Techniques for Steganography and Digital Watermarking」 , 2000 .

[64]  Martha Palmer,et al.  Class-Based Construction of a Verb Lexicon , 2000, AAAI/IAAI.

[65]  Radu Sion,et al.  Natural Language Watermarking and Tamperproofing , 2002, Information Hiding.

[66]  Geoffrey Sampson,et al.  English for the Computer: The SUSANNE Corpus and Analytic Scheme , 1995, Computational Linguistics.

[67]  Bülent Sankur,et al.  Syntactic tools for text watermarking , 2007, Electronic Imaging.

[68]  Siwei Lyu,et al.  Detecting Hidden Messages Using Higher-Order Statistics and Support Vector Machines , 2002, Information Hiding.

[69]  Daniel Marcu,et al.  Syntax-based Alignment of Multiple Translations: Extracting Paraphrases and Generating New Sentences , 2003, NAACL.

[70]  Mikhail J. Atallah,et al.  Translation-based steganography , 2005, J. Comput. Secur..

[71]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[72]  Markus G. Kuhn,et al.  Attacks on Copyright Marking Systems , 1998, Information Hiding.

[73]  J. Doug Tygar,et al.  The battle against phishing: Dynamic Security Skins , 2005, SOUPS '05.

[74]  Hannes Federrath,et al.  Modeling the Security of Steganographic Systems , 1998, Information Hiding.

[75]  Alain Polguère,et al.  Bilingual Generation of Weather Forecasts in an Operations Environment , 1990, COLING.

[76]  Eugene Charniak,et al.  A Maximum-Entropy-Inspired Parser , 2000, ANLP.

[77]  Mohan S. Kankanhalli,et al.  A dual watermarking technique for images , 1999, MULTIMEDIA '99.

[78]  Mikhail J. Atallah,et al.  Natural Language Watermarking: Design, Analysis, and a Proof-of-Concept Implementation , 2001, Information Hiding.

[79]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[80]  Radu Sion Power: a metric for evaluating watermarking algorithms , 2002, Proceedings. International Conference on Information Technology: Coding and Computing.

[81]  Mikhail J. Atallah,et al.  ViWiD : Visible Watermarking Based Defense Against Phishing , 2005, IWDW.

[82]  Edward J. Delp,et al.  Natural language watermarking , 2005, IS&T/SPIE Electronic Imaging.

[83]  Gordon W. Braudaway,et al.  Protecting publicly available images with a visible image watermark , 1996, Electronic Imaging.

[84]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[85]  Dirk Noël Beth Levin. English Verb Classes and Alternations: A Preliminary Investigation , 1995 .

[86]  Stefan Katzenbeisser,et al.  Towards Human Interactive Proofs in the Text-Domain (Using the Problem of Sense-Ambiguity for Security) , 2004, ISC.

[87]  Radu Sion,et al.  Rights protection for discrete numeric streams , 2006, IEEE Transactions on Knowledge and Data Engineering.

[88]  Ehud Reiter,et al.  Book Reviews: Building Natural Language Generation Systems , 2000, CL.

[89]  Amir Herzberg,et al.  TrustBar: Protecting (even Naïve) Web Users from Spoofing and Phishing Attacks , 2004 .

[90]  Philip Resnik,et al.  Selectional Preference and Sense Disambiguation , 1997 .

[91]  Walter Bender,et al.  Techniques for Data Hiding , 1996, IBM Syst. J..

[92]  S. Turkle Life on the Screen: Identity in the Age of the Internet , 1997 .

[93]  Daniel Marcu,et al.  Towards Developing Generation Algorithms for Text-to-Text Applications , 2005, ACL.

[94]  Barry Pless,et al.  Eats, Shoots & Leaves , 2005, Injury Prevention.

[95]  Sergei Nirenburg,et al.  Book Review: Ontological Semantics, by Sergei Nirenburg and Victor Raskin , 2004, CL.

[96]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[97]  Daniel Marcu,et al.  Stochastic Language Generation Using WIDL-Expressions and its Application in Machine Translation and Summarization , 2006, ACL.

[98]  David Crystal,et al.  Language and the Internet , 2001 .

[99]  Ian Goldberg,et al.  Privacy-Enhancing Technologies for the Internet, II: Five Years Later , 2002, Privacy Enhancing Technologies.