Fingerprinting Text in Logical Markup Languages

Information hiding is attracting an increasing attention from the research community. Most of this research has centered around hiding information, such as watermarks and fingerprints, in images or digital audio and video signals. Text has generally been treated as a black & white image with special properties. All of the current methods of hiding information in text are vulnerable to scanning followed by optical character recognition in order to reconstruct the text.Document distribution is increasingly relying on logical markup languages like HTML and XML, where the physical presentation of the text is determined by the user's browser. Embedding the watermark in the physical presentation of the document is therefore no longer practical. We argue that embedding syntactic or semantic fingerprints in text is the only viable way to fingerprint document in logical markup languages such as HTML or XML.In this paper, we propose a new semantic fingerprinting mechanism based on synonymsubstitution. This idea is developed into an operational system and results of preliminary experiments are reported.

[1]  Sergei Nirenburg,et al.  Natural language processing for information assurance and security: an overview and implementations , 2001, NSPW '00.

[2]  Lawrence O'Gorman,et al.  Electronic marking and identification techniques to discourage document copying , 1994, Proceedings of INFOCOM '94 Conference on Computer Communications.

[3]  Hector Garcia-Molina,et al.  Building a scalable and accurate copy detection mechanism , 1996, DL '96.

[4]  Steven H. Low,et al.  Marking text documents , 1997, Proceedings of International Conference on Image Processing.

[5]  C. M. Sperberg-McQueen,et al.  Extensible markup language , 1997 .

[6]  Steven H. Low,et al.  Document identification for copyright protection using centroid detection , 1998, IEEE Trans. Commun..

[7]  Amos Fiat,et al.  Tracing traitors , 2000, IEEE Trans. Inf. Theory.

[8]  Walter Bender,et al.  Techniques for Data Hiding , 1996, IBM Syst. J..

[9]  Hector Garcia-Molina,et al.  SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[10]  George W. Davidson,et al.  Roget's Thesaurus of English Words and Phrases , 1982 .

[11]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[12]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[13]  Christian S. Collberg,et al.  Software watermarking: models and dynamic embeddings , 1999, POPL '99.

[14]  Steven J. DeRose,et al.  Markup systems and the future of scholarly text processing , 1987, CACM.

[15]  Markus G. Kuhn,et al.  Information hiding-A survey : Identification and protection of multimedia information , 1999 .

[16]  Steven H. Low,et al.  Performance comparison of two text marking methods , 1998, IEEE J. Sel. Areas Commun..

[17]  Yvo Desmedt,et al.  Advances in Cryptology — CRYPTO ’94 , 2001, Lecture Notes in Computer Science.

[18]  S.H. Low,et al.  Capacity of text marking channel , 2000, IEEE Signal Processing Letters.

[19]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[20]  C. Fellbaum An Electronic Lexical Database , 1998 .

[21]  C. M. Sperberg-McQueen,et al.  Extensible Markup Language (XML) , 1997, World Wide Web J..

[22]  David G. Durand,et al.  What is text, really? , 1990, J. Comput. High. Educ..

[23]  Markus G. Kuhn,et al.  Information hiding-a survey , 1999, Proc. IEEE.