The compression in ZIP is shithouse. its practically non-existant, and i
feel it needs to be replaced with something better. Obviously such
techniques as LZW and HUFFMAN is not worthwhile as each string is
individually compressed, and RLE encoding will gain us nothing.
I dont know the name of the technique or who first created it but I've
been using it for years in my programs such as Text2Exe and my
diskmagazines and some other programs.
I never seem to explain things well, so...... :) and i only program in
assembler, so thats what my examples will be in, but it should not
be too hard to translate them into C.
The technique only works for text, whish suits ZIP down to the ground, and
it pretty much garuntees you a ratio of 2:1.
how it works is that because we are dealing with plain ascii text, the top
bit is free, this we use as our compressed marker bit, then we split the
next 7 bits up into a primary character of 4 bits, and a secondary
character of 3 bits.
It works on the fact that you have your 16 most common characters, one of
them being a space, and then you have the 8 most common characters that
follow each of the 16 main characters.
You have a translation library of 256 characters, and this is the library
i have been using for years and years
(remember, its straight from my ASM source)
Primary_Library: db " aeioustdbglrnhc"
Secondary_Library: db " eorsgnt" ;SPACE
db " ntrgvsl" ;A
db " aorsdnl" ;E
db "ensgltcd" ;I
db " ounrfsl" ;O
db " ernlstc" ;U
db " aeiortp" ;S
db " aeiorhc" ;T
db " aeior,." ;D
db "raeiou.," ;B
db " aeohrls" ;G
db " aeiolyd" ;L
db " eaiodsy" ;R
db " aeiodtg" ;N
db " aeioltr" ;H
db "laeomrst" ;O
By altering this table, will alter the level of compression. Just by
looking at the table is plain to see its limitations, but it would produce
much greater compression than whats currently in ZIP.
okey, back to the tech stuff,
two compressed characters are encoded in one 8bit character layed out as
7 6543 210
1 - Marker bit
1111 - Primary character
111 - Secondary character
So, inorder to decompress a compressed character,
(again, this is straight from my asm code)
xor ax,ax
lodsb
test al,080h
je NotCompressed
mov ah,al
and ax,07807h
xor bx,bx
mov bl,ah
shr bx,3
mov cl,byte ptr [bx + offset primary_library]
xor bx,bx
mov bl,ah
cbw
add bx,ax
mov ch,byte ptr [bx + offset secondary_library]
xchg cx,ax
NotCompressed:
;al = first character
;ah = second character. if ah=0, its not a character.
If you dont understand that, then, um..... i said i have trouble
explaining what i mean. sorry.
hmmmm. oh well.
its just a thought....
--+--
Dark Fiber [NuKE] | Entropy, killing the WWW bandwidth stealing
a.k.a. Entropy | wankers in 1995.
entropy@shell.break.com.au | [NuKE] 1995 world tour. Beware the Genome! .01a
--+--