Re: Zip v8, stuff on compression?


Thu, 23 Nov 1995 17:50:17 -0500 (EST)

The presence of non-lowercase letters (upper case, punctuation, and
various other control symbols frequently found in strings) appears to
significantly lower the effiency of the algorithm, which does not even
take them into account, although you could play around with the letter
tables to maybe up the effiency some (how much I've nary a clue).

So in terms of compression, this method is comparable to Radix-40, but
it does not at first hack appear to be any more effienct, which means
the effort of retrofitting the various and sundry Inform compilers and
interpreters out there would not be worth it, especially considering the
volume of game files available which use the present compression scheme.

Still, IMHO it is a slightly better compression algorithm from a coding
point of view, since it keeps all compression within a single byte at a
time, as opposed to Radix-40 which tries to compress three chars across
a 16-bit word, which is somewhat messy from an aesthetic point of view
(not that hackers like myself are ever much concerned with aesthetics).

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
/*

Original algorithm proposed by "Dark Fiber".
C Source hack by Jeff Standish, jestandi@cs.indiana.edu, 23-Nov-95.
All rights thrown to the winds.
Use this code at your own risk.
So if it eats your hard drive, it's not my damned fault!

Following comments from article posted to rec.arts.int-fiction:
Re: Zip v8, stuff on compression?
Dark Fiber <entropy@I_should_put_my_domain_in_etc_NNTP_INEWS_DOMAIN> wrote:
>
> The technique only works for text, whish suits ZIP down to the ground, and
> it pretty much garuntees you a ratio of 2:1.
>
> how it works is that because we are dealing with plain ascii text, the top
> bit is free, this we use as our compressed marker bit, then we split the
> next 7 bits up into a primary character of 4 bits, and a secondary
> character of 3 bits.
>
> It works on the fact that you have your 16 most common characters, one of
> them being a space, and then you have the 8 most common characters that
> follow each of the 16 main characters.
>
> You have a translation library of 256 characters, and this is the library
> i have been using for years and years
>
> By altering this table, will alter the level of compression. Just by
> looking at the table is plain to see its limitations, but it would produce
> much greater compression than whats currently in ZIP.
>
> okey, back to the tech stuff,
>
> two compressed characters are encoded in one 8bit character layed out as
>
> 7 6543 210
> 1 - Marker bit
> 1111 - Primary character
> 111 - Secondary character
*/

char *primary = " aeioustdbglrnhc";
char *secondary[] = {
" eorsgnt", /* SPACE */
" ntrgvsl", /* A */
" aorsdnl", /* E */
"ensgltcd", /* I */
" ounrfsl", /* O */
" ernlstc", /* U */
" aeiortp", /* S */
" aeiorhc", /* T */
" aeior,.", /* D */
"raeiou.,", /* B */
" aeohrls", /* G */
" aeiolyd", /* L */
" eaiodsy", /* R */
" aeiodtg", /* N */
" aeioltr", /* H */
"laeomrst" }; /* C */

main()
{
char instrg[256], compstrg[256], outstrg[256];
int inlen, complen, outlen;

/* get input string */
printf("Input: ");
gets(instrg);

/* get length of string */
inlen = strlen(instrg);
printf("Input length: 0\n", inlen);

/* compress the string */
complen = compress_string(instrg, compstrg, inlen);
printf("Compressed length: 0\n", complen);

/* uncompress the string */
outlen = uncompress_string(compstrg, outstrg, complen);
printf("Uncompressed length: 0\n", outlen);

/* make sure compression and uncompression worked right */
if (strcmp(instrg, outstrg))
printf("Error! The uncompressed string does not match original!\n");
printf("Output: \"\"\n", outstrg);
}

/* given a string and its length, compress the string and
* return the length of the compressed string
*/
int compress_string(strg, pressed, len)
char *strg, *pressed;
int len;
{
int index, pdex, primidx, secidx;

index = 0;
pdex = 0;
while (index < len) {
primidx = index_of(strg[index], primary, 16);

/* if this letter is not in the primary list, then just copy
it into the compressed string */
if (primidx < 0)
pressed[pdex] = strg[index];

/* otherwise it is in the primary list, so check to see if the
following letter is from the appropriate secondary list */
else {
secidx = index_of(strg[index+1], secondary[primidx], 8);

/* if following letter is not in secondary list, then just
copy the first letter into the compressed string and
do nothing special */
if (secidx < 0)
pressed[pdex] = strg[index];

/* otherwise it is in the secondary list, so compress both
letters into a single byte */
else {
pressed[pdex] = ((0x10 | primidx) << 3) | secidx;
++index;
}
}
++index;
++pdex;
}

/* mark the end of the compressed string */
pressed[pdex] = '\0';

return (pdex);
}

/* given a letter, a string, and the length of the string,
* return the offset of the letter in the string,
* or return -1 of not found in the string
*/
int index_of(letter, strg, len)
char letter, *strg;
int len;
{
int index;

for (index = 0; index < len; ++index)
if (letter == strg[index])
return (index);

return (-1);
}

/* given a compressed string and the length of the string, uncompress it and
* return the length of the uncompressed string
*/
uncompress_string(strg, unpress, len)
char *strg, *unpress;
int len;
{
char *sptr;
int index, udex, primidx, secidx;

udex = 0;
for (index = 0; index < len; ++index) {

/* if this byte contains a compressed letter pair, grab the
primary and secondary indices from the byte, look up the
letter pair, and append the two letters to the end of the
uncompressed string */
if (strg[index] & 0x80) {
primidx = (strg[index] >> 3) & 0x0f;
secidx = strg[index] & 0x07;
sptr = secondary[primidx];
unpress[udex++] = primary[primidx];
unpress[udex] = sptr[secidx];
}

/* otherwise the byte contains a normal letter, so add it to
the end of the uncompressed string */
else
unpress[udex] = strg[index];

++udex;
}

/* mark the end of the uncompressed string */
unpress[udex] = '\0';

return (udex);
}

-- 
----------------------------------------------------------------------------
Jeff Standish                                      The concept is simply
jestandi@cs.indiana.edu                            staggering.  Pointless,
http://www.cs.indiana.edu/hyplan/jestandi.html     but staggering.  -Dr. Who
----------------------------------------------------------------------------