The Secret's in the Word Order: Text-to-Text Generation for Linguistic Steganography

Linguistic steganography is a form of covert communication using natural language to conceal the existence of the hidden message, which is usually achieved by systematically making changes to a cover text. This paper proposes a linguistic steganography method using word ordering as the linguistic transformation. We show that the word ordering technique can be used in conjunction with existing translation-based embedding algorithms. Since unnatural word orderings would arouse the suspicion of third parties and diminish the security of the hidden message, we develop a method using a maximum entropy classifier to determine the naturalness of sentence permutations. The classifier is evaluated by human judgements and compared with a baseline method using the Google n-gram corpus. The results show that our proposed system can achieve a satisfactory security level and embedding capacity for the linguistic steganography application.

[1]  Christopher D. Manning,et al.  The Stanford Typed Dependencies Representation , 2008, CF+CDPE@COLING.

[2]  Mikhail J. Atallah,et al.  Words are not enough: sentence level natural language watermarking , 2006, MCPS '06.

[3]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[4]  Aoife Cahill,et al.  Human Evaluation of a German Surface Realisation Ranker , 2009, EACL.

[5]  Mi-Young Kim Natural Language Watermarking for Korean Using Adverbial Displacement , 2008, 2008 International Conference on Multimedia and Ubiquitous Engineering (mue 2008).

[6]  Ronald Rosenfeld,et al.  A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[7]  Liusheng Huang,et al.  STBS: A Statistical Algorithm for Steganalysis of Translation-Based Steganography , 2010, Information Hiding.

[8]  Mikhail J. Atallah,et al.  The hiding virtues of ambiguity: quantifiably resilient watermarking of natural language text through synonym substitutions , 2006, MM&Sec '06.

[9]  Mikhail J. Atallah,et al.  Lost in just the translation , 2006, SAC.

[10]  Carl Vogel,et al.  Statistically-constrained shallow text marking: techniques, evaluation paradigm and results , 2007, Electronic Imaging.

[11]  Josef van Genabith,et al.  Coling 2008: Proceedings of the workshop on Cross-Framework and Cross-Domain Parser Evaluation , 2008, COLING 2008.

[12]  Stephan Oepen,et al.  Statistical Ranking in Tactical Generation , 2006, EMNLP.

[13]  Xingming Sun,et al.  A Natural Language Watermarking Based on Chinese Syntax , 2005, ICNC.

[14]  James R. Curran,et al.  Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models , 2007, Computational Linguistics.

[15]  Brian Murphy,et al.  Syntactic Information Hiding in Plain Text , 2001 .

[16]  Jun'ichi Tsujii,et al.  Probabilistic Models for Disambiguation of an HPSG-Based Chart Generator , 2005, IWPT.

[17]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[18]  Mikhail J. Atallah,et al.  Natural Language Watermarking: Design, Analysis, and a Proof-of-Concept Implementation , 2001, Information Hiding.

[19]  Diana Inkpen,et al.  Real-Word Spelling Correction using Google Web 1T 3-grams , 2009, EMNLP.

[20]  Bülent Sankur,et al.  Syntactic tools for text watermarking , 2007, Electronic Imaging.

[21]  Stephen Clark,et al.  Syntax-Based Grammaticality Improvement using CCG and Guided Search , 2011, EMNLP.

[22]  Radu Sion,et al.  Natural Language Watermarking and Tamperproofing , 2002, Information Hiding.

[23]  Yun Q. Shi,et al.  LinL: Lost in n-best List , 2011, Information Hiding.

[24]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[25]  Stephen Wan,et al.  Improving Grammaticality in Statistical Sentence Generation: Introducing a Dependency Spanning Tree Algorithm with an Argument Satisfaction Model , 2009, EACL.

[26]  Jessica Fridrich,et al.  Steganography in Digital Media: References , 2009 .

[27]  Stephen Clark,et al.  Syntax-Based Word Ordering Incorporating a Large-Scale Language Model , 2012, EACL.

[28]  Mi-Young Kim,et al.  Natural Language Watermarking by Morpheme Segmentation , 2009, 2009 First Asian Conference on Intelligent Information and Database Systems.

[29]  Mark Johnson,et al.  Estimators for Stochastic “Unification-Based” Grammars , 1999, ACL.

[30]  Benoit M. Macq,et al.  A method of text watermarking using presuppositions , 2007, Electronic Imaging.

[31]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[32]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[33]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[34]  Liusheng Huang,et al.  Blind Linguistic Steganalysis against Translation Based Steganography , 2010, IWDW.

[35]  Dilek Z. Hakkani-Tür,et al.  Natural language watermarking: challenges in building a practical system , 2006, Electronic Imaging.

[36]  Erik Velldal,et al.  Empirical Realization Ranking , 2009 .

[37]  Carl Vogel,et al.  The syntax of concealment: reliable methods for plain text information hiding , 2007, Electronic Imaging.

[38]  Igor A. Bolshakov,et al.  A Method of Linguistic Steganography Based on Collocationally-Verified Synonymy , 2004, Information Hiding.

[39]  James R. Curran,et al.  Classification of Verb Particle Constructions with the Google Web1T Corpus , 2008, ALTA.

[40]  Stephen Clark,et al.  Linguistic Steganography Using Automatically Generated Paraphrases , 2010, NAACL.

[41]  Adwait Ratnaparkhi,et al.  Learning to Parse Natural Language with Maximum Entropy Models , 1999, Machine Learning.

[42]  Randy Goebel,et al.  Web-Scale N-gram Models for Lexical Disambiguation , 2009, IJCAI.

[43]  Stephen Clark,et al.  Practical Linguistic Steganography Using Contextual Synonym Substitution and Vertex Colour Coding , 2010, EMNLP.

[44]  Edward J. Delp,et al.  Natural language watermarking , 2005, IS&T/SPIE Electronic Imaging.

[45]  Michael White,et al.  Minimal Dependency Length in Realization Ranking , 2012, EMNLP.

[46]  Tom M. Mitchell,et al.  Data Analysis Project: Leveraging Massive Textual Corpora Using n-Gram Statistics , 2008 .

[47]  Bülent Sankur,et al.  Natural language watermarking via morphosyntactic alterations , 2009, Comput. Speech Lang..

[48]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[49]  James R. Curran,et al.  Investigating GIS and Smoothing for Maximum Entropy Taggers , 2003, EACL.

[50]  Sergei Nirenburg,et al.  Natural language processing for information assurance and security: an overview and implementations , 2001, NSPW '00.

[51]  Mark Chapman,et al.  Hiding the Hidden: A software system for concealing ciphertext as innocuous text , 1997, ICICS.