Statistically-constrained shallow text marking: techniques, evaluation paradigm and results

We present three natural language marking strategies based on fast and reliable shallow parsing techniques, and on widely available lexical resources: lexical substitution, adjective conjunction swaps, and relativiser switching. We test these techniques on a random sample of the British National Corpus. Individual candidate marks are checked for goodness of structural and semantic fit, using both lexical resources, and the web as a corpus. A representative sample of marks is given to 25 human judges to evaluate for acceptability and preservation of meaning. This establishes a correlation between corpus based felicity measures and perceived quality, and makes qualified predictions. Grammatical acceptability correlates with our automatic measure strongly (Pearson's r = 0.795, p = 0.001), allowing us to account for about two thirds of variability in human judgements. A moderate but statistically insignificant (Pearson's r = 0.422, p = 0.356) correlation is found with judgements of meaning preservation, indicating that the contextual window of five content words used for our automatic measure may need to be extended.

[1]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[2]  Jun'ichi Tsujii,et al.  Probabilistic Disambiguation Models for Wide-Coverage HPSG Parsing , 2005, ACL.

[3]  Brian Murphy,et al.  Syntactic Information Hiding in Plain Text , 2001 .

[4]  Eric Brill,et al.  Some Advances in Transformation-Based Part of Speech Tagging , 1994, AAAI.

[5]  Adam Kilgarriff,et al.  Introduction to the Special Issue on the Web as Corpus , 2003, CL.

[6]  Dilek Z. Hakkani-Tür,et al.  Natural language watermarking: challenges in building a practical system , 2006, Electronic Imaging.

[7]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[8]  Frank Keller,et al.  Using the Web to Obtain Frequencies for Unseen Bigrams , 2003, CL.

[9]  Carl Vogel,et al.  The syntax of concealment: reliable methods for plain text information hiding , 2007, Electronic Imaging.

[10]  Lawrence O'Gorman,et al.  Electronic marking and identification techniques to discourage document copying , 1994, Proceedings of INFOCOM '94 Conference on Computer Communications.

[11]  J. Marks Longman Grammar of Spoken and Written English , 2001 .

[12]  Igor A. Bolshakov,et al.  A Method of Linguistic Steganography Based on Collocationally-Verified Synonymy , 2004, Information Hiding.

[13]  Edward J. Delp,et al.  Attacks on lexical natural language steganography systems , 2006, Electronic Imaging.

[14]  Radu Sion,et al.  Natural Language Watermarking and Tamperproofing , 2002, Information Hiding.

[15]  Mikhail J. Atallah,et al.  Natural Language Watermarking: Design, Analysis, and a Proof-of-Concept Implementation , 2001, Information Hiding.

[16]  飯島 周 「会話の文法」に関する一考察 : Longman Grammar of Spoken and Written Englishの場合 , 1999 .

[17]  Carl Vogel,et al.  Cross-linguistic empirical analysis of constraints on passive , 2006 .

[18]  Xingming Sun,et al.  Component-based digital watermarking of Chinese texts , 2004, InfoSecu '04.