论文信息 - Words are not enough: sentence level natural language watermarking

Words are not enough: sentence level natural language watermarking

Compared to other media, natural language text presents unique challenges for information hiding. These challenges require the design of a robust algorithm that can work under following constraints: (i) low embedding bandwidth, i.e., number of sentences is comparable with message length, (ii) not all transformations can be applied to a given sentence (iii) the number of alternative forms for a sentence is relatively small, a limitation governed by the grammar and vocabulary of the natural language, as well as the requirement to preserve the style and fluency of the document. The adversary can carry out all the transformations used for embedding to remove the embedded message. In addition, the adversary can also permute the sentences, select and use a subset of sentences, and insert new sentences. We give a scheme that overcomes these challenges, together with a partial implementation and its evaluation for the English language. The present application of this scheme works at the sentence level while also using a word-level watermarking technique that was recently designed and built into a fully automatic system ("Equimark"). Unlike Equimark, whose resilience relied on the introduction of ambiguities, the present paper's sentence-level technique is more tuned to situations where very little change to the text is allowable (i.e., when style is important). Secondarily, this paper shows how to use lower-level (in this case word-level) marking to improve the resilience and embedding properties of higher level (in this case sentence level) schemes. We achieve this by using the word-based methods as a separate channel from the sentence-based methods, thereby improving the results of either one alone. The sentence level watermarking technique we introduce is novel and powerful, as it relies on multiple features of each sentence and exploits the notion of orthogonality between features.

Mikhail J. Atallah | Umut Topkara | Mercan Topkara

[1] Mark Przybocki,et al. NIST 2005 machine translation evaluation official results , 2005 .

[2] Mikhail J. Atallah,et al. Natural Language Watermarking: Design, Analysis, and a Proof-of-Concept Implementation , 2001, Information Hiding.

[3] Richard Bergmair,et al. Towards Linguistic Steganography: A Systematic Investigation of Approaches, Systems, and Issues , 2004 .

[4] XTAG Research Group,et al. A Lexicalized Tree Adjoining Grammar for English , 1998, ArXiv.

[5] Geoffrey Sampson,et al. English for the Computer: The SUSANNE Corpus and Analytic Scheme , 1995, Computational Linguistics.

[6] Dilek Z. Hakkani-Tür,et al. Natural language watermarking: challenges in building a practical system , 2006, Electronic Imaging.

[7] Radu Sion,et al. Natural Language Watermarking and Tamperproofing , 2002, Information Hiding.

[8] Anne Abeillé,et al. A Lexicalized Tree Adjoining Grammar for English , 1990 .

[9] Beth Levin,et al. English Verb Classes and Alternations: A Preliminary Investigation , 1993 .

[10] Mikhail J. Atallah,et al. The hiding virtues of ambiguity: quantifiably resilient watermarking of natural language text through synonym substitutions , 2006, MM&Sec '06.

[11] Mark Chapman,et al. Plausible Deniability Using Automated Linguistic Stegonagraphy , 2002, InfraSec.