The hiding virtues of ambiguity: quantifiably resilient watermarking of natural language text through synonym substitutions

Information-hiding in natural language text has mainly consisted of carrying out approximately meaning-preserving modifications on the given cover text until it encodes the intended mark. A major technique for doing so has been synonym-substitution. In these previous schemes, synonym substitutions were done until the text "confessed", i.e., carried the intended mark message. We propose here a better way to use synonym substitution, one that is no longer entirely guided by the mark-insertion process: It is also guided by a resilience requirement, subject to a maximum allowed distortion constraint. Previous schemes for information hiding in natural language text did not use numeric quantification of the distortions introduced by transformations, they mainly used heuristic measures of quality based on conformity to a language model (and not in reference to the original cover text). When there are many alternatives to carry out a substitution on a word, we prioritize these alternatives according to a quantitative resilience criterion and use them in that order. In a nutshell, we favor the more ambiguous alternatives. In fact not only do we attempt to achieve the maximum ambiguity, but we want to simultaneously be as close as possible to the above-mentioned distortion limit, as that prevents the adversary from doing further transformations without exceeding the damage threshold; that is, we continue to modify the document even after the text has "confessed" to the mark, for the dual purpose of maximizing ambiguity while deliberately getting as close as possible to the distortion limit. The quantification we use makes possible an application of the existing information-theoretic framework, to the natural language domain, which has unique challenges not present in the image or audio domains. The resilience stems from both (i) the fact that the adversary does not know where the changes were made, and (ii) the fact that automated disambiguation is a major difficulty faced by any natural language processing system (what is bad news for the natural language processing area, is good news for our scheme's resilience). In addition to the above mentioned design and analysis, another contribution of this paper is the description of the implementation of the scheme and of the experimental data obtained.

[1]  Jayant R. Haritsa,et al.  A Framework for High-Accuracy Privacy-Preserving Mining , 2005, ICDE.

[2]  H. Kucera,et al.  Computational analysis of present-day American English , 1967 .

[3]  Sergei Nirenburg,et al.  Natural language processing for information assurance and security: an overview and implementations , 2001, NSPW '00.

[4]  Radu Sion,et al.  Rights protection for discrete numeric streams , 2006, IEEE Transactions on Knowledge and Data Engineering.

[5]  Dilek Z. Hakkani-Tür,et al.  Natural language watermarking: challenges in building a practical system , 2006, Electronic Imaging.

[6]  Jessica J. Fridrich,et al.  Efficient Wet Paper Codes , 2005, Information Hiding.

[7]  Christiane Fellbaum,et al.  Building Semantic Concordances , 1998 .

[8]  Radu Sion,et al.  Natural Language Watermarking and Tamperproofing , 2002, Information Hiding.

[9]  Rainer Böhme,et al.  Statistical characterisation of MP3 encoders for steganalysis , 2004, MM&Sec '04.

[10]  Joseph A. O'Sullivan,et al.  Information-theoretic analysis of information hiding , 2003, IEEE Trans. Inf. Theory.

[11]  Philip Resnik,et al.  Selectional Preference and Sense Disambiguation , 1997 .

[12]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[13]  Mikhail J. Atallah,et al.  Translation-based steganography , 2005, J. Comput. Secur..

[14]  Mark Chapman,et al.  Plausible Deniability Using Automated Linguistic Stegonagraphy , 2002, InfraSec.

[15]  Stefan Katzenbeisser,et al.  Towards Human Interactive Proofs in the Text-Domain (Using the Problem of Sense-Ambiguity for Security) , 2004, ISC.

[16]  Edward J. Delp,et al.  Attacks on lexical natural language steganography systems , 2006, Electronic Imaging.

[17]  Ted Pedersen,et al.  WordNet::Similarity - Measuring the Relatedness of Concepts , 2004, NAACL.

[18]  Richard Bergmair,et al.  Towards Linguistic Steganography: A Systematic Investigation of Approaches, Systems, and Issues , 2004 .

[19]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[20]  Charles L. A. Clarke,et al.  Frequency Estimates for Statistical Word Similarity Measures , 2003, NAACL.

[21]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[22]  Radu Sion Power: a metric for evaluating watermarking algorithms , 2002, Proceedings. International Conference on Information Technology: Coding and Computing.

[23]  Edward J. Delp,et al.  Natural language watermarking , 2005, IS&T/SPIE Electronic Imaging.

[24]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.