A natural language approach to automated cryptanalysis of two-time pads

While keystream reuse in stream ciphers and one-time pads has been a well known problem for several decades, the risk to real systems has been underappreciated. Previous techniques have relied on being able to accurately guess words and phrases that appear in one of the plaintext messages, making it far easier to claim that "an attacker would never be able to do that." In this paper, we show how an adversary can automatically recover messages encrypted under the same keystream if only the type of each message is known (e.g. an HTML page in English). Our method, which is related to HMMs, recovers the most probable plaintext of this type by using a statistical language model and a dynamic programming algorithm. It produces up to 99% accuracy on realistic data and can process ciphertexts at 200ms per byte on a $2,000 PC. To further demonstrate the practical effectiveness of the method, we show that our tool can recover documents encrypted by Microsoft Word 2002 [22].

[1]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[2]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[3]  Frank Rubin Computer Methods for Decrypting Random Stream Ciphers , 1978, Cryptologia.

[4]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[5]  Roger K. Moore Computer Speech and Language , 1986 .

[6]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[7]  Bernard P. Zajac Applied cryptography: Protocols, algorithms, and source code in C , 1994 .

[8]  Ed Dawson,et al.  Automated Cryptanalysis of XOR Plaintext Strings , 1996, Cryptologia.

[9]  Michael Warner,et al.  Venona : Soviet espionage and the American response 1939-1957 , 1997 .

[10]  Bruce Schneier,et al.  Cryptanalysis of Microsoft's PPTP Authentication Extensions (MS-CHAPv2) , 1999, CQRE.

[11]  Morris J. Dworkin,et al.  Recommendation for Block Cipher Modes of Operation: Methods and Techniques , 2001 .

[12]  Dawn Xiaodong Song,et al.  Timing Analysis of Keystrokes and Timing Attacks on SSH , 2001, USENIX Security Symposium.

[13]  David A. Wagner,et al.  Intercepting mobile communications: the insecurity of 802.11 , 2001, MobiCom '01.

[14]  Morris J. Dworkin,et al.  SP 800-38A 2001 edition. Recommendation for Block Cipher Modes of Operation: Methods and Techniques , 2001 .

[15]  Dar-Shyang Lee,et al.  Substitution Deciphering Based on HMMs with Applications to Compressed Document Processing , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Sanjeev Khudanpur,et al.  Contemporaneous text as side-information in statistical language modeling , 2004, Comput. Speech Lang..

[17]  Tadayoshi Kohno,et al.  Attacking and repairing the winZip encryption scheme , 2004, CCS '04.

[18]  Hongjun Wu The Misuse of RC4 in Microsoft Word and Excel , 2005, IACR Cryptol. ePrint Arch..

[19]  Bob Carpenter,et al.  Scaling High-Order Character Language Models to Gigabytes , 2005, ACL 2005.

[20]  Vitaly Shmatikov,et al.  Fast dictionary attacks on passwords using time-space tradeoff , 2005, CCS '05.

[21]  Feng Zhou,et al.  Keyboard acoustic emanations revisited , 2005, CCS '05.

[22]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .