Practical Linguistic Steganography using Contextual Synonym Substitution and a Novel Vertex Coding Method

Linguistic steganography is concerned with hiding information in natural language text. One of the major transformations used in linguistic steganography is synonym substitution. However, few existing studies have studied the practical application of this approach. In this article we propose two improvements to the use of synonym substitution for encoding hidden bits of information. First, we use the Google n-gram corpus for checking the applicability of a synonym in context, and we evaluate this method using data from the SemEval lexical substitution task and human annotated data. Second, we address the problem that arises from words with more than one sense, which creates a potential ambiguity in terms of which bits are represented by a particular word. We develop a novel method in which words are the vertices in a graph, synonyms are linked by edges, and the bits assigned to a word are determined by a vertex coding algorithm. This method ensures that each word represents a unique sequence of bits, without cutting out large numbers of synonyms, and thus maintains a reasonable embedding capacity.

[1]  Xingming Sun,et al.  A Natural Language Watermarking Based on Chinese Syntax , 2005, ICNC.

[2]  Serge Sharoff,et al.  Open-source Corpora: Using the net to fish for linguistic data , 2006 .

[3]  Butler W. Lampson,et al.  A note on the confinement problem , 1973, CACM.

[4]  Frank Y. Shih,et al.  Digital Watermarking and Steganography: Fundamentals and Techniques , 2007 .

[5]  Anna Korhonen,et al.  Probabilistic models of similarity in syntactic context , 2011, EMNLP.

[6]  Sergei Nirenburg,et al.  Natural language processing for information assurance and security: an overview and implementations , 2001, NSPW '00.

[7]  Richard M. Schwartz,et al.  Hedge Trimmer: A Parse-and-Trim Approach to Headline Generation , 2003, HLT-NAACL 2003.

[8]  Mark Chapman,et al.  Hiding the Hidden: A software system for concealing ciphertext as innocuous text , 1997, ICICS.

[9]  Guy William Willis Stevens Microphotography : photography at extreme resolution , 1958 .

[10]  Mi-Young Kim Natural Language Watermarking for Korean Using Adverbial Displacement , 2008, 2008 International Conference on Multimedia and Ubiquitous Engineering (mue 2008).

[11]  Stephen Clark,et al.  Linguistic Steganography Using Automatically Generated Paraphrases , 2010, NAACL.

[12]  Mikhail J. Atallah,et al.  The hiding virtues of ambiguity: quantifiably resilient watermarking of natural language text through synonym substitutions , 2006, MM&Sec '06.

[13]  Gustavus J. Simmons,et al.  The Prisoners' Problem and the Subliminal Channel , 1983, CRYPTO.

[14]  Evgueni A. Haroutunian,et al.  Information Theory and Statistics , 2011, International Encyclopedia of Statistical Science.

[15]  Mikhail J. Atallah,et al.  Words are not enough: sentence level natural language watermarking , 2006, MCPS '06.

[16]  Mikhail J. Atallah,et al.  Lost in just the translation , 2006, SAC.

[17]  Bernard Newman,et al.  Secrets of German espionage , 1940 .

[18]  Radu Sion,et al.  Natural Language Watermarking and Tamperproofing , 2002, Information Hiding.

[19]  Liusheng Huang,et al.  Blind Linguistic Steganalysis against Translation Based Steganography , 2010, IWDW.

[20]  Dilek Z. Hakkani-Tür,et al.  Natural language watermarking: challenges in building a practical system , 2006, Electronic Imaging.

[21]  Iryna Gurevych,et al.  A Monolingual Tree-based Translation Model for Sentence Simplification , 2010, COLING.

[22]  R. Kirk 1 Experimental Design , 2012 .

[23]  Peter Wayner,et al.  Mimic Functions , 1992, Cryptologia.

[24]  Carl Vogel,et al.  The syntax of concealment: reliable methods for plain text information hiding , 2007, Electronic Imaging.

[25]  Frank Harary,et al.  Graph Theory , 2016 .

[26]  Lip Yee Por,et al.  WhiteSteg: a new scheme in information hiding using text steganography , 2008 .

[27]  Chris Callison-Burch,et al.  Syntactic Constraints on Paraphrases Extracted from Parallel Corpora , 2008, EMNLP.

[28]  Birgit Pfitzmann,et al.  Information Hiding Terminology - Results of an Informal Plenary Meeting and Additional Proposals , 1996, Information Hiding.

[29]  Igor A. Bolshakov,et al.  A Method of Linguistic Steganography Based on Collocationally-Verified Synonymy , 2004, Information Hiding.

[30]  Edward J. Delp,et al.  Attacks on lexical natural language steganography systems , 2006, Electronic Imaging.

[31]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[32]  James R. Curran,et al.  Classification of Verb Particle Constructions with the Google Web1T Corpus , 2008, ALTA.

[33]  Stephen Clark,et al.  The Secret's in the Word Order: Text-to-Text Generation for Linguistic Steganography , 2012, COLING.

[34]  John A. Carroll,et al.  Applied morphological processing of English , 2001, Natural Language Engineering.

[35]  Stephen Clark,et al.  Adjective Deletion for Linguistic Steganography and Secret Sharing , 2012, COLING.

[36]  James R. Curran,et al.  Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models , 2007, Computational Linguistics.

[37]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[38]  Stephen Clark,et al.  Syntax-Based Grammaticality Improvement using CCG and Guided Search , 2011, EMNLP.

[39]  Md. Khairullah A Novel Text Steganography System Using Font Color of the Invisible Characters in Microsoft Word Documents , 2009, 2009 Second International Conference on Computer and Electrical Engineering.

[40]  Yun Q. Shi,et al.  LinL: Lost in n-best List , 2011, Information Hiding.

[41]  Jessica Fridrich,et al.  Steganography in Digital Media: References , 2009 .

[42]  Mirella Lapata,et al.  Measuring Distributional Similarity in Context , 2010, EMNLP.

[43]  Jan Hajic,et al.  Semi-Supervised Training for the Averaged Perceptron POS Tagger , 2009, EACL.

[44]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[45]  Mi-Young Kim Natural Language Watermarking by Morpheme Segmentation , 2009, 2009 First Asian Conference on Intelligent Information and Database Systems.

[46]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[47]  Mikhail J. Atallah,et al.  Natural Language Watermarking: Design, Analysis, and a Proof-of-Concept Implementation , 2001, Information Hiding.

[48]  Carl Vogel,et al.  Statistically-constrained shallow text marking: techniques, evaluation paradigm and results , 2007, Electronic Imaging.

[49]  Peter Wayner Strong Theoretical Stegnography , 1995, Cryptologia.

[50]  Richard Bergmair,et al.  A comprehensive bibliography of linguistic steganography , 2007, Electronic Imaging.

[51]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[52]  Lillian Lee,et al.  Measures of Distributional Similarity , 1999, ACL.

[53]  Tom M. Mitchell,et al.  Data Analysis Project: Leveraging Massive Textual Corpora Using n-Gram Statistics , 2008 .

[54]  Krista Bennett,et al.  LINGUISTIC STEGANOGRAPHY: SURVEY, ANALYSIS, AND ROBUSTNESS CONCERNS FOR HIDING INFORMATION IN TEXT , 2004 .

[55]  Bülent Sankur,et al.  Natural language watermarking via morphosyntactic alterations , 2009, Comput. Speech Lang..

[56]  R. Kirk Experimental Design: Procedures for the Behavioral Sciences , 1970 .

[57]  Mirella Lapata,et al.  Sentence Compression Beyond Word Deletion , 2008, COLING.

[58]  Richard Bergmair,et al.  Towards Linguistic Steganography: A Systematic Investigation of Approaches, Systems, and Issues , 2004 .

[59]  Ingemar J. Cox,et al.  Digital Watermarking and Steganography , 2014 .

[60]  Mohammad Shirali Shahreza,et al.  A New Method for Steganography in HTML Files , 2007 .

[61]  Diana Inkpen,et al.  Real-Word Spelling Correction using Google Web 1T 3-grams , 2009, EMNLP.

[62]  Bülent Sankur,et al.  Syntactic tools for text watermarking , 2007, Electronic Imaging.

[63]  Michael A. Soderstrand,et al.  Residue number system arithmetic: modern applications in digital signal processing , 1986 .

[64]  Brian Murphy,et al.  Syntactic Information Hiding in Plain Text , 2001 .

[65]  Marc Rennhard,et al.  A Practical and Effective Approach to Large-Scale Automated Linguistic Steganography , 2001, ISC.

[66]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[67]  Juri Ganitkevitch,et al.  Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation. , 2011, EMNLP.

[68]  Liusheng Huang,et al.  STBS: A Statistical Algorithm for Steganalysis of Translation-Based Steganography , 2010, Information Hiding.

[69]  Benoit M. Macq,et al.  A method of text watermarking using presuppositions , 2007, Electronic Imaging.

[70]  Mikhail J. Atallah,et al.  Translation-based steganography , 2005, J. Comput. Secur..

[71]  Roberto Navigli,et al.  SemEval-2007 Task 10: English Lexical Substitution Task , 2007, Fourth International Workshop on Semantic Evaluations (SemEval-2007).

[72]  Kazuaki Kishida Property of average precision and its generalization: An examination of evaluation indicator for information retrieval experiments , 2005 .

[73]  D. Kahn The codebreakers : the story of secret writing , 1968 .

[74]  Anders Søgaard,et al.  Simple Semi-Supervised Training of Part-Of-Speech Taggers , 2010, ACL.

[75]  Randy Goebel,et al.  Web-Scale N-gram Models for Lexical Disambiguation , 2009, IJCAI.

[76]  Katrin Erk,et al.  Exemplar-Based Models for Word Meaning in Context , 2010, ACL.

[77]  Martha Palmer,et al.  Verbnet: a broad-coverage, comprehensive verb lexicon , 2005 .

[78]  Stephen Clark,et al.  Practical Linguistic Steganography Using Contextual Synonym Substitution and Vertex Colour Coding , 2010, EMNLP.

[79]  Edward J. Delp,et al.  Natural language watermarking , 2005, IS&T/SPIE Electronic Imaging.

[80]  Giorgio Satta,et al.  Guided Learning for Bidirectional Sequence Classification , 2007, ACL.