Data Compression and Encryption Algorithms

The algorithms for balancing splay-trees, a form of self-adjusting binary search tree invented by Dan Sleator and analyzed by Bob Tarjan, can be adapted to the job of balancing the trie used within a prefix code. This was reported in the paper Applications of Splay Trees to Data Compression by Douglas W. Jones in Communications of the ACM, Aug. 1988, pages 996-1007.

Simple splay-tree based compression.
Simple splay-tree based uncompression.
A UNIX splay-tree based compression utility.

The final piece of code listed above is roughly comparable to the commonly used UNIX compress and uncompress utilities based on LZW compression. The user interface is largely compatable, and it compresses about as well, although not quite as quickly.

The splay-compression utility is actually a subset of a more interesting utility that contained some additional cryptographic features. That code cannot (as far as I know) be exported from the United States. The basic idea of the cryptographic features (an idea used by Lotus in their product Ami Pro) is to condition the initial code tree used for data compression by the successive characters of the encryption key before use.

The Lotus Ami Pro version of the algorithm is insecure (deliberately, in order to allow export). In addition to conditioning the initial code tree with the password, they also include the password (encoded as a string of two digit decimal numbers) as a prefix on the data to be compressed and encrypted. As a result, if the password is purely alphabetic (and therefore doesn't mess up the encoding of the digits), the decimal form of the password can be inferred fairly easily. The situation is modestly more complex if the password is partially numeric. The weaknesses (and details) of the Lotus Ami Pro encryption scheme were reported to me by Paul C. Kocher (pck@netcom.com).

The weakness of Lotus's approach can be patched by removing the password from the head of the string being encrypted, and replacing it by a randomly chosen prefix. This makes known plaintext attacks difficult by using the random prefix to "stir up the tree" prior to the occurance of any known data.

Splay encryption is vulnerable to a chosen plaintext attacks! There are sequences of bytes that will expose the current state of the tree to someone who examines the compressed data, thus allowing any following data to be decoded. It may be possible to patch this weakness by a combination of run-length encoding and the insertion of random data at periodic points in the string to be encrypted, where these points must be spaced sufficiently closely to break the chosen plaintext string into short segments. More detail on thiese suggestions is available.

None of the comments about patching the weakness of splay-tree based encryption should be taken as any assurance that the patched result is very secure! It is only an accident that splay trees are useful for encryption, and the security of variants on the algorithm is still largely an open question.

A full version of the splay-tree compression program, including the cryptographic features stripped from the version released over the Web, is available by E-mail from the author.

The comp.compression FAQ is a good general source of references on this entire subject area. There is also an excellent list of compression pointers on the web, sorted by author and by substance.

In the spring of 1993, I taught 22C:16, Introduction to Programming in Pascal. Every semester I've taught this course, I try to find a "theme" around which to organize my programming assignments. For the spring of 1993, I decided to use some very elementary cryptography as the subject matter for programming assignments.

Machine problem two -- elementary Caesar cyphers (rot13).
- The assignment
- The solution distributed to students
- The test data (cyphertext)
- The assignment
- The solution distributed to students
- The test data (cyphertext)
- The key used to create the cyphertext
- The assignment
- The solution distributed to students
- The letter frequency data
- test data (cyphertext)
- test data (cyphertext)
- test data (cyphertext)
- The assignment
- Unfortunately, I have lost the solution I distributed
- The assignment
- The solution distributed to students
- The codebook for the sample text
- The sample text (in encoded form)