A key question is, what's a good metric? I knew that the ideal metric would take into account how the LZ77 algorithm works, but for expediency, I went with a reasonable and well known string distance: the levenshtein distance.
GZip vs 7-zip
In digging around for a Java 7-zip deflate implementation, I ran into this paper on Paul Sladen's page. titled: Compressing File Collections with a TSP-Based Approach. This paper published in 2004 details a similar technique of enhancing TAR compression of a collection of files by re-ordering the TAR to maximize compression. More on that later, but the paper itself has an obscure reference to something called the LZ Distance metric. Tracking down that paper "G. Cormode, M. Paterson, S. Sahinalp, and U. Vishkin. Communication complexity of doc- ument exchange. In Proc. of the ACM–SIAM Symp. on Discrete Algorithms, January 2000." leads us to the definition of LZ Distance.
The LZ Distance between two strings x and y denoted by LZDIST(x,y) is as follows:
The minimum number of single characters or substrings of y or of the partially built string, which are required to build x from left to right.
As an example, consider x=aabcdefgabc and y=aaaaaaaaaaaaaaaa. We try to parse x in terms of y, and keep a running partially built string called 'p'. Here are the steps:
- Prefix 'aa' of x is contained in y, so p = (aa), x = bcdefgabc
- No prefix of x found in y or p, so just add 'b' p = (aa)(b), x = cdefgabc
- No prefix of x found in y or p ,so just add 'c', p = (aa)(b)(c), x = defgabc
- No prefix of x found in y or p ,so just add 'd', p = (aa)(b)(c)(d), x = efgabc
- No prefix of x found in y or p ,so just add 'e', p = (aa)(b)(c)(d)(e), x = fgabc
- No prefix of x found in y or p ,so just add 'f', p = (aa)(b)(c)(d)(e)(f), x = gabc
- No prefix of x found in y or p ,so just add 'g', p = (aa)(b)(c)(d)(e)(f)(g), x = abc
- 'abc' is found at index '1' in p, so p = (aa)(b)(c)(d)(e)(f)(g)(abc)
Here, the LZDIST(x,y) = 8. But what of LZDIST(y, x)?
- Prefix 'aa' of y is contained in x, so p = (aa), y = aaaaaaaaaaaaaa
- Prefix 'aa' of y is contained in x, so p = (aa)(aa), y = aaaaaaaaaaaa
- Prefix 'aaaa' of y is contained in p, p = (aa)((aa)(aaaa) y = aaaaaaaa
- Prefix 'aaaaaaaa' of y is contained in p, so p = (aa)(aa)(aaaa)(aaaaaaaa)
LZDIST(y,x) = 4. Interesting, but what does this buy us?
LZDIST and GZIP
Recall that GZIP's deflate encoder uses an algorithm called LZ77 which looks at the current input tokens, and searches backwards for previous substrings that match the current input string, replacing them with back references. What LZDIST(x,y) gives us, is a measure of how many back references for X can be found in Y, that is, how X can be compressed by references to substrings in Y or in partially built decompression outputs.
Clustering with LZDIST
Remember the earlier paper's title had the acronym TSP? This paper discusses how re-ordering can be viewed as an optimal tour on a graph which means that solving it is as hard as the Traveling Salesman Problem - NP-Complete. Uh oh, that sounds like trouble, but that's only NP Complete for the optimal solution. A good estimate on the true solution might also produce good results, and this paper shows that a greedy tour heuristic produces up to a 10% benefit in compression.
For expediency reasons, I decided not to implement a TSP model at this point, but went with simple pair-based greedy approach detailed in my previous blog post, substituting LZDIST for Levenshtein Distance. I also implemented an LZ-like greedy algorithm where I keep a running string which is the concatenation of all of the functions which have been added to the output, and use LZDIST to find the next closest function to append within a 32k window. This mimics how a gzip compressor might see the input.
Results for GWT Showcase Application
|Technique||Size||Ratio||Relative Improvement (vs previous row)|
|gzip -9 + levensthein||61409||31.2%||4.7%|
|7zip + levenshtein||58536||29.7%||4.7%|
|7zip + lzdist||57923||29.4%||1.1%|
Looking at relative compression, levenshtein vs lzdist, the difference 57923/58536 = 1.1% which seems worthwhile to me. In a large script, this can save 1-2 kilobytes. I threw in LZMA which boasts 12% better compression in this instance than DEFLATE.
Notes on implementation
The LZDIST function is expensive to compute in a naive fashion. It is similar to the longest common substring problem. The brute force version can be on the order of O(n^3), while a dynamic programming version can run in O(n^2). To make this run efficiently and not irritate your users, you need to use a Generalized Suffix Tree approach. Compressors like 7-zip are already using Suffix trees to implement the LZ77 search algorithm, so those of you looking for hints to do an efficient implementation should take a look at the LZMA SDK (both C++ and Java versions are available)
Utilizing clustering compression techniques, LZ distance, and 7-zip, it is possible to reduce the GWT Showcase application from 64401 bytes to 57923 bytes, a savings of about 10.1%.