1*0Sstevel@tonic-gateFirst up, let me say I don't like writing in assembler. It is not portable, 2*0Sstevel@tonic-gatedependant on the particular CPU architecture release and is generally a pig 3*0Sstevel@tonic-gateto debug and get right. Having said that, the x86 architecture is probably 4*0Sstevel@tonic-gatethe most important for speed due to number of boxes and since 5*0Sstevel@tonic-gateit appears to be the worst architecture to to get 6*0Sstevel@tonic-gategood C compilers for. So due to this, I have lowered myself to do 7*0Sstevel@tonic-gateassembler for the inner DES routines in libdes :-). 8*0Sstevel@tonic-gate 9*0Sstevel@tonic-gateThe file to implement in assembler is des_enc.c. Replace the following 10*0Sstevel@tonic-gate4 functions 11*0Sstevel@tonic-gatedes_encrypt1(DES_LONG data[2],des_key_schedule ks, int encrypt); 12*0Sstevel@tonic-gatedes_encrypt2(DES_LONG data[2],des_key_schedule ks, int encrypt); 13*0Sstevel@tonic-gatedes_encrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3); 14*0Sstevel@tonic-gatedes_decrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3); 15*0Sstevel@tonic-gate 16*0Sstevel@tonic-gateThey encrypt/decrypt the 64 bits held in 'data' using 17*0Sstevel@tonic-gatethe 'ks' key schedules. The only difference between the 4 functions is that 18*0Sstevel@tonic-gatedes_encrypt2() does not perform IP() or FP() on the data (this is an 19*0Sstevel@tonic-gateoptimization for when doing triple DES and des_encrypt3() and des_decrypt3() 20*0Sstevel@tonic-gateperform triple des. The triple DES routines are in here because it does 21*0Sstevel@tonic-gatemake a big difference to have them located near the des_encrypt2 function 22*0Sstevel@tonic-gateat link time.. 23*0Sstevel@tonic-gate 24*0Sstevel@tonic-gateNow as we all know, there are lots of different operating systems running on 25*0Sstevel@tonic-gatex86 boxes, and unfortunately they normally try to make sure their assembler 26*0Sstevel@tonic-gateformating is not the same as the other peoples. 27*0Sstevel@tonic-gateThe 4 main formats I know of are 28*0Sstevel@tonic-gateMicrosoft Windows 95/Windows NT 29*0Sstevel@tonic-gateElf Includes Linux and FreeBSD(?). 30*0Sstevel@tonic-gatea.out The older Linux. 31*0Sstevel@tonic-gateSolaris Same as Elf but different comments :-(. 32*0Sstevel@tonic-gate 33*0Sstevel@tonic-gateNow I was not overly keen to write 4 different copies of the same code, 34*0Sstevel@tonic-gateso I wrote a few perl routines to output the correct assembler, given 35*0Sstevel@tonic-gatea target assembler type. This code is ugly and is just a hack. 36*0Sstevel@tonic-gateThe libraries are x86unix.pl and x86ms.pl. 37*0Sstevel@tonic-gatedes586.pl, des686.pl and des-som[23].pl are the programs to actually 38*0Sstevel@tonic-gategenerate the assembler. 39*0Sstevel@tonic-gate 40*0Sstevel@tonic-gateSo to generate elf assembler 41*0Sstevel@tonic-gateperl des-som3.pl elf >dx86-elf.s 42*0Sstevel@tonic-gateFor Windows 95/NT 43*0Sstevel@tonic-gateperl des-som2.pl win32 >win32.asm 44*0Sstevel@tonic-gate 45*0Sstevel@tonic-gate[ update 4 Jan 1996 ] 46*0Sstevel@tonic-gateI have added another way to do things. 47*0Sstevel@tonic-gateperl des-som3.pl cpp >dx86-cpp.s 48*0Sstevel@tonic-gategenerates a file that will be included by dx86unix.cpp when it is compiled. 49*0Sstevel@tonic-gateTo build for elf, a.out, solaris, bsdi etc, 50*0Sstevel@tonic-gatecc -E -DELF asm/dx86unix.cpp | as -o asm/dx86-elf.o 51*0Sstevel@tonic-gatecc -E -DSOL asm/dx86unix.cpp | as -o asm/dx86-sol.o 52*0Sstevel@tonic-gatecc -E -DOUT asm/dx86unix.cpp | as -o asm/dx86-out.o 53*0Sstevel@tonic-gatecc -E -DBSDI asm/dx86unix.cpp | as -o asm/dx86bsdi.o 54*0Sstevel@tonic-gateThis was done to cut down the number of files in the distribution. 55*0Sstevel@tonic-gate 56*0Sstevel@tonic-gateNow the ugly part. I acquired my copy of Intels 57*0Sstevel@tonic-gate"Optimization's For Intel's 32-Bit Processors" and found a few interesting 58*0Sstevel@tonic-gatethings. First, the aim of the exersize is to 'extract' one byte at a time 59*0Sstevel@tonic-gatefrom a word and do an array lookup. This involves getting the byte from 60*0Sstevel@tonic-gatethe 4 locations in the word and moving it to a new word and doing the lookup. 61*0Sstevel@tonic-gateThe most obvious way to do this is 62*0Sstevel@tonic-gatexor eax, eax # clear word 63*0Sstevel@tonic-gatemovb al, cl # get low byte 64*0Sstevel@tonic-gatexor edi DWORD PTR 0x100+des_SP[eax] # xor in word 65*0Sstevel@tonic-gatemovb al, ch # get next byte 66*0Sstevel@tonic-gatexor edi DWORD PTR 0x300+des_SP[eax] # xor in word 67*0Sstevel@tonic-gateshr ecx 16 68*0Sstevel@tonic-gatewhich seems ok. For the pentium, this system appears to be the best. 69*0Sstevel@tonic-gateOne has to do instruction interleaving to keep both functional units 70*0Sstevel@tonic-gateoperating, but it is basically very efficient. 71*0Sstevel@tonic-gate 72*0Sstevel@tonic-gateNow the crunch. When a full register is used after a partial write, eg. 73*0Sstevel@tonic-gatemov al, cl 74*0Sstevel@tonic-gatexor edi, DWORD PTR 0x100+des_SP[eax] 75*0Sstevel@tonic-gate386 - 1 cycle stall 76*0Sstevel@tonic-gate486 - 1 cycle stall 77*0Sstevel@tonic-gate586 - 0 cycle stall 78*0Sstevel@tonic-gate686 - at least 7 cycle stall (page 22 of the above mentioned document). 79*0Sstevel@tonic-gate 80*0Sstevel@tonic-gateSo the technique that produces the best results on a pentium, according to 81*0Sstevel@tonic-gatethe documentation, will produce hideous results on a pentium pro. 82*0Sstevel@tonic-gate 83*0Sstevel@tonic-gateTo get around this, des686.pl will generate code that is not as fast on 84*0Sstevel@tonic-gatea pentium, should be very good on a pentium pro. 85*0Sstevel@tonic-gatemov eax, ecx # copy word 86*0Sstevel@tonic-gateshr ecx, 8 # line up next byte 87*0Sstevel@tonic-gateand eax, 0fch # mask byte 88*0Sstevel@tonic-gatexor edi DWORD PTR 0x100+des_SP[eax] # xor in array lookup 89*0Sstevel@tonic-gatemov eax, ecx # get word 90*0Sstevel@tonic-gateshr ecx 8 # line up next byte 91*0Sstevel@tonic-gateand eax, 0fch # mask byte 92*0Sstevel@tonic-gatexor edi DWORD PTR 0x300+des_SP[eax] # xor in array lookup 93*0Sstevel@tonic-gate 94*0Sstevel@tonic-gateDue to the execution units in the pentium, this actually works quite well. 95*0Sstevel@tonic-gateFor a pentium pro it should be very good. This is the type of output 96*0Sstevel@tonic-gateVisual C++ generates. 97*0Sstevel@tonic-gate 98*0Sstevel@tonic-gateThere is a third option. instead of using 99*0Sstevel@tonic-gatemov al, ch 100*0Sstevel@tonic-gatewhich is bad on the pentium pro, one may be able to use 101*0Sstevel@tonic-gatemovzx eax, ch 102*0Sstevel@tonic-gatewhich may not incur the partial write penalty. On the pentium, 103*0Sstevel@tonic-gatethis instruction takes 4 cycles so is not worth using but on the 104*0Sstevel@tonic-gatepentium pro it appears it may be worth while. I need access to one to 105*0Sstevel@tonic-gateexperiment :-). 106*0Sstevel@tonic-gate 107*0Sstevel@tonic-gateeric (20 Oct 1996) 108*0Sstevel@tonic-gate 109*0Sstevel@tonic-gate22 Nov 1996 - I have asked people to run the 2 different version on pentium 110*0Sstevel@tonic-gatepros and it appears that the intel documentation is wrong. The 111*0Sstevel@tonic-gatemov al,bh is still faster on a pentium pro, so just use the des586.pl 112*0Sstevel@tonic-gateinstall des686.pl 113*0Sstevel@tonic-gate 114*0Sstevel@tonic-gate3 Dec 1996 - I added des_encrypt3/des_decrypt3 because I have moved these 115*0Sstevel@tonic-gatefunctions into des_enc.c because it does make a massive performance 116*0Sstevel@tonic-gatedifference on some boxes to have the functions code located close to 117*0Sstevel@tonic-gatethe des_encrypt2() function. 118*0Sstevel@tonic-gate 119*0Sstevel@tonic-gate9 Jan 1997 - des-som2.pl is now the correct perl script to use for 120*0Sstevel@tonic-gatepentiums. It contains an inner loop from 121*0Sstevel@tonic-gateSvend Olaf Mikkelsen <svolaf@inet.uni-c.dk> which does raw ecb DES calls at 122*0Sstevel@tonic-gate273,000 per second. He had a previous version at 250,000 and the best 123*0Sstevel@tonic-gateI was able to get was 203,000. The content has not changed, this is all 124*0Sstevel@tonic-gatedue to instruction sequencing (and actual instructions choice) which is able 125*0Sstevel@tonic-gateto keep both functional units of the pentium going. 126*0Sstevel@tonic-gateWe may have lost the ugly register usage restrictions when x86 went 32 bit 127*0Sstevel@tonic-gatebut for the pentium it has been replaced by evil instruction ordering tricks. 128*0Sstevel@tonic-gate 129*0Sstevel@tonic-gate13 Jan 1997 - des-som3.pl, more optimizations from Svend Olaf. 130*0Sstevel@tonic-gateraw DES at 281,000 per second on a pentium 100. 131*0Sstevel@tonic-gate 132