TStego-THU: Large-Scale Text Steganalysis Dataset

In recent years, with the development of natural language processing (NLP) technology, linguistic steganography has developed rapidly. However, to the best of our knowledge, currently there is no public dataset for text steganalysis, which makes it difficult for linguistic steganalysis methods to get a fair comparison. Therefore, in this paper, we construct and release a large-scale linguistic steganalysis dataset called TStego-THU, which we hope to provide a fair enough platform for comparison of linguistic steganalysis algorithms and further promote the development of linguistic steganalysis. TStego-THU includes two kinds of text steganography modes, namely, text modification-based and text generation-based modes, each of which provides two latest or classical text steganography algorithms. All texts in TStego-THU come from three common transmitted text medias in cyberspace: News, Twitter and commentary text. Finally, TStego-THU contains 240,000 sentences (120,000 cover-stego text pairs), each steganographic sentence is generated by randomly choosing one of these four steganographic algorithms and embedding random bitstream into randomly extracted normal texts. At the same time, we also evaluate some latest text steganalysis algorithms as benchmarks on TStego-THU, the detail results can be found in the experiment part. We hope that TStego-THU will further promote the development of universal text steganalysis technology. The description of TStego-THU and instructions will be released here: https://github.com/YangzlTHU/Linguistic-Steganography-and-Steganalysis.

[1]  Xu Li,et al.  A linguistic steganography based on word indexing compression and candidate selection , 2018, Multimedia Tools and Applications.

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Hu Zheng,et al.  Linguistic Steganography Detection Algorithm Using Statistical Language Model , 2009, 2009 International Conference on Information Technology and Computer Science.

[4]  Zhong-Liang Yang,et al.  VAE-Stega: Linguistic Steganography Based on Variational Auto-Encoder , 2021, IEEE Transactions on Information Forensics and Security.

[5]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[6]  Ping Zhong,et al.  Convolutional Neural Network Based Text Steganalysis , 2019, IEEE Signal Processing Letters.

[7]  Yong-Feng Huang,et al.  RNN-Stega: Linguistic Steganography Based on Recurrent Neural Networks , 2019, IEEE Transactions on Information Forensics and Security.

[8]  Yongfeng Huang,et al.  A Sudoku Matrix-Based Method of Pitch Period Steganography in Low-Rate Speech Coding , 2017, SecureComm.

[9]  Yongfeng Huang,et al.  A Fast and Efficient Text Steganalysis Method , 2019, IEEE Signal Processing Letters.

[10]  Claude E. Shannon,et al.  Communication theory of secrecy systems , 1949, Bell Syst. Tech. J..

[11]  Yongfeng Huang,et al.  TS-RNN: Text Steganalysis Based on Recurrent Neural Networks , 2019, IEEE Signal Processing Letters.

[12]  Yongfeng Huang,et al.  IStego100K: Large-scale Image Steganalysis Dataset , 2019, IWDW.

[13]  Yongfeng Huang,et al.  TS-CSW: text steganalysis and hidden capacity estimation based on convolutional sliding windows , 2020, Multimedia Tools and Applications.

[14]  Peng Liu,et al.  A Novel Linguistic Steganography Based on Synonym Run-Length Encoding , 2017, IEICE Trans. Inf. Syst..

[15]  Goutam Sanyal,et al.  A real time text steganalysis by using statistical method , 2016, 2016 IEEE International Conference on Engineering and Technology (ICETECH).

[16]  Yongfeng Huang,et al.  RITS: Real-Time Interactive Text Steganography Based on Automatic Dialogue Model , 2018, ICCCS.

[17]  Yongfeng Huang,et al.  Image Captioning with Object Detection and Localization , 2017, ICIG.

[18]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[19]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[20]  Catherine A. Theohary,et al.  Terrorist Use of the Internet: Information Operations in Cyberspace , 2011 .