First up, let me say I don't like writing in assember.  It is not portable,
dependant on the particular CPU achitecture release and is generally a pig
to debug and get right.  Having said that, the x86 architecture is probably
the most important for speed due to number of boxes and since
it appears to be the worst arcitecture to to get
good C compilers for.  So due to this, I have lowered myself to do
assember for the inner DES routines in libdes :-).

The file to implement in assember is des_enc.c.  Replace the following
4 functions
des_encrypt(DES_LONG data[2],des_key_schedule ks, int encrypt);
des_encrypt2(DES_LONG data[2],des_key_schedule ks, int encrypt);
des_encrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3);
des_decrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3);

They encrypt/decrypt the 64 bits held in 'data' using
the 'ks' key schedules.   The only differnece between the 4 functions is that
des_encrypt2() does not perform IP() or FP() on the data (this is an
optimisation for when doing triple DES and des_encrypt3() and des_decrypt3()
perform triple des.  The triple DES routines are in here because it does
make a big difference to have them located near the des_encrypt2 function
at link time..

Now as we all know, there are lots of different operating systems running on
x86 boxes, and unfortunatly they normally try to make sure their assember
formating is not the same as the other peoples.
The 4 main formats I know of are
Microsoft	Windows 95/Windows NT
Elf		Includes Linux and FreeBSD(?).
a.out		The older Linux.
Solaris		Same as Elf but different comments :-(.

Now I was not overly keen to write 4 different copies of the same code,
so I wrote a few perl routines to output the correct assember, given
a target assember type.  This code is ugly and is just a hack.
The libraries are x86unix.pl and x86ms.pl.
des586.pl and des686.pl are the programs to actually generate the assember.

So to generate elf assember
perl des586.pl elf >dx86-elf.s
For Windows 95/NT
perl des586.pl win32 >win32.asm

Now the ugly part.  I aquired my copy of Intels
"Optimizations's For Intel's 32-Bit Processors" and found a few interesting
things.  First, the aim of the excersize is to 'extract' one byte at a time
from a word and do an array lookup.  This involves getting the byte from
the 4 locations in the word and moving it to a new word and doing the lookup.
The most obvious way to do this is
xor	eax,	eax				# clear word
movb	al,	cl				# get low byte
xor	edi	DWORD PTR 0x100+des_SP[eax] 	# xor in word
movb	al,	ch				# get next byte
xor	edi	DWORD PTR 0x300+des_SP[eax] 	# xor in word
shr	ecx	16
which seems ok.  For the pentium, this system appears to be the best.
One has to do instruction interleaving to keep both functional units
operating, but it is basically very efficent.

Now the crunch.  When a full register is used after a partial write, eg.
mov	al,	cl
xor	edi,	DWORD PTR 0x100+des_SP[eax]
386	- 1 cycle stall
486	- 1 cycle stall
586	- 0 cycle stall
686	- at least 7 cycle stall (page 22 of the above mentioned document).

So the technique that produces the best results on a pentium, acording to
the documentation, will produce hidious results on a pentium pro.

To get around this, des686.pl will generate code that is not as fast on
a pentium, should be very good on a pentium pro.
mov	eax,	ecx				# copy word 
shr	ecx,	8				# line up next byte
and	eax,	0fch				# mask byte
xor	edi	DWORD PTR 0x100+des_SP[eax] 	# xor in array lookup
mov	eax,	ecx				# get word
shr	ecx	8				# line up next byte
and	eax,	0fch				# mask byte
xor	edi	DWORD PTR 0x300+des_SP[eax] 	# xor in array lookup

Due to the execution units in the pentium, this actually works quite well.
For a pentium pro it should be very good.  This is the type of output
Visual C++ generates.

There is a third option.  instead of using
mov	al,	ch
which is bad on the pentium pro, one may be able to use
movzx	eax,	ch
which may not incure the partial write penalty.  On the pentium,
this instruction takes 4 cycles so is not worth using but on the
pentium pro it appears it may be worth while.  I need access to one to
experiment :-).

eric (20 Oct 1996)

22 Nov 1996 - I have asked people to run the 2 different version on pentium
pros and it appears that the intel documentation is wrong.  The
mov al,bh is still faster on a pentium pro, so just use the des586.pl
install des686.pl

3 Dec 1996 - I added des_encrypt3/des_decrypt3 because I have moved these
functions into des_enc.c because it does make a masive performance
difference on some boxes to have the functions code located close to
the des_encrypt2() function.

