xref: /onnv-gate/usr/src/common/openssl/crypto/des/asm/readme (revision 0:68f95e015346)
1*0Sstevel@tonic-gateFirst up, let me say I don't like writing in assembler.  It is not portable,
2*0Sstevel@tonic-gatedependant on the particular CPU architecture release and is generally a pig
3*0Sstevel@tonic-gateto debug and get right.  Having said that, the x86 architecture is probably
4*0Sstevel@tonic-gatethe most important for speed due to number of boxes and since
5*0Sstevel@tonic-gateit appears to be the worst architecture to to get
6*0Sstevel@tonic-gategood C compilers for.  So due to this, I have lowered myself to do
7*0Sstevel@tonic-gateassembler for the inner DES routines in libdes :-).
8*0Sstevel@tonic-gate
9*0Sstevel@tonic-gateThe file to implement in assembler is des_enc.c.  Replace the following
10*0Sstevel@tonic-gate4 functions
11*0Sstevel@tonic-gatedes_encrypt1(DES_LONG data[2],des_key_schedule ks, int encrypt);
12*0Sstevel@tonic-gatedes_encrypt2(DES_LONG data[2],des_key_schedule ks, int encrypt);
13*0Sstevel@tonic-gatedes_encrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3);
14*0Sstevel@tonic-gatedes_decrypt3(DES_LONG data[2],des_key_schedule ks1,ks2,ks3);
15*0Sstevel@tonic-gate
16*0Sstevel@tonic-gateThey encrypt/decrypt the 64 bits held in 'data' using
17*0Sstevel@tonic-gatethe 'ks' key schedules.   The only difference between the 4 functions is that
18*0Sstevel@tonic-gatedes_encrypt2() does not perform IP() or FP() on the data (this is an
19*0Sstevel@tonic-gateoptimization for when doing triple DES and des_encrypt3() and des_decrypt3()
20*0Sstevel@tonic-gateperform triple des.  The triple DES routines are in here because it does
21*0Sstevel@tonic-gatemake a big difference to have them located near the des_encrypt2 function
22*0Sstevel@tonic-gateat link time..
23*0Sstevel@tonic-gate
24*0Sstevel@tonic-gateNow as we all know, there are lots of different operating systems running on
25*0Sstevel@tonic-gatex86 boxes, and unfortunately they normally try to make sure their assembler
26*0Sstevel@tonic-gateformating is not the same as the other peoples.
27*0Sstevel@tonic-gateThe 4 main formats I know of are
28*0Sstevel@tonic-gateMicrosoft	Windows 95/Windows NT
29*0Sstevel@tonic-gateElf		Includes Linux and FreeBSD(?).
30*0Sstevel@tonic-gatea.out		The older Linux.
31*0Sstevel@tonic-gateSolaris		Same as Elf but different comments :-(.
32*0Sstevel@tonic-gate
33*0Sstevel@tonic-gateNow I was not overly keen to write 4 different copies of the same code,
34*0Sstevel@tonic-gateso I wrote a few perl routines to output the correct assembler, given
35*0Sstevel@tonic-gatea target assembler type.  This code is ugly and is just a hack.
36*0Sstevel@tonic-gateThe libraries are x86unix.pl and x86ms.pl.
37*0Sstevel@tonic-gatedes586.pl, des686.pl and des-som[23].pl are the programs to actually
38*0Sstevel@tonic-gategenerate the assembler.
39*0Sstevel@tonic-gate
40*0Sstevel@tonic-gateSo to generate elf assembler
41*0Sstevel@tonic-gateperl des-som3.pl elf >dx86-elf.s
42*0Sstevel@tonic-gateFor Windows 95/NT
43*0Sstevel@tonic-gateperl des-som2.pl win32 >win32.asm
44*0Sstevel@tonic-gate
45*0Sstevel@tonic-gate[ update 4 Jan 1996 ]
46*0Sstevel@tonic-gateI have added another way to do things.
47*0Sstevel@tonic-gateperl des-som3.pl cpp >dx86-cpp.s
48*0Sstevel@tonic-gategenerates a file that will be included by dx86unix.cpp when it is compiled.
49*0Sstevel@tonic-gateTo build for elf, a.out, solaris, bsdi etc,
50*0Sstevel@tonic-gatecc -E -DELF asm/dx86unix.cpp | as -o asm/dx86-elf.o
51*0Sstevel@tonic-gatecc -E -DSOL asm/dx86unix.cpp | as -o asm/dx86-sol.o
52*0Sstevel@tonic-gatecc -E -DOUT asm/dx86unix.cpp | as -o asm/dx86-out.o
53*0Sstevel@tonic-gatecc -E -DBSDI asm/dx86unix.cpp | as -o asm/dx86bsdi.o
54*0Sstevel@tonic-gateThis was done to cut down the number of files in the distribution.
55*0Sstevel@tonic-gate
56*0Sstevel@tonic-gateNow the ugly part.  I acquired my copy of Intels
57*0Sstevel@tonic-gate"Optimization's For Intel's 32-Bit Processors" and found a few interesting
58*0Sstevel@tonic-gatethings.  First, the aim of the exersize is to 'extract' one byte at a time
59*0Sstevel@tonic-gatefrom a word and do an array lookup.  This involves getting the byte from
60*0Sstevel@tonic-gatethe 4 locations in the word and moving it to a new word and doing the lookup.
61*0Sstevel@tonic-gateThe most obvious way to do this is
62*0Sstevel@tonic-gatexor	eax,	eax				# clear word
63*0Sstevel@tonic-gatemovb	al,	cl				# get low byte
64*0Sstevel@tonic-gatexor	edi	DWORD PTR 0x100+des_SP[eax] 	# xor in word
65*0Sstevel@tonic-gatemovb	al,	ch				# get next byte
66*0Sstevel@tonic-gatexor	edi	DWORD PTR 0x300+des_SP[eax] 	# xor in word
67*0Sstevel@tonic-gateshr	ecx	16
68*0Sstevel@tonic-gatewhich seems ok.  For the pentium, this system appears to be the best.
69*0Sstevel@tonic-gateOne has to do instruction interleaving to keep both functional units
70*0Sstevel@tonic-gateoperating, but it is basically very efficient.
71*0Sstevel@tonic-gate
72*0Sstevel@tonic-gateNow the crunch.  When a full register is used after a partial write, eg.
73*0Sstevel@tonic-gatemov	al,	cl
74*0Sstevel@tonic-gatexor	edi,	DWORD PTR 0x100+des_SP[eax]
75*0Sstevel@tonic-gate386	- 1 cycle stall
76*0Sstevel@tonic-gate486	- 1 cycle stall
77*0Sstevel@tonic-gate586	- 0 cycle stall
78*0Sstevel@tonic-gate686	- at least 7 cycle stall (page 22 of the above mentioned document).
79*0Sstevel@tonic-gate
80*0Sstevel@tonic-gateSo the technique that produces the best results on a pentium, according to
81*0Sstevel@tonic-gatethe documentation, will produce hideous results on a pentium pro.
82*0Sstevel@tonic-gate
83*0Sstevel@tonic-gateTo get around this, des686.pl will generate code that is not as fast on
84*0Sstevel@tonic-gatea pentium, should be very good on a pentium pro.
85*0Sstevel@tonic-gatemov	eax,	ecx				# copy word
86*0Sstevel@tonic-gateshr	ecx,	8				# line up next byte
87*0Sstevel@tonic-gateand	eax,	0fch				# mask byte
88*0Sstevel@tonic-gatexor	edi	DWORD PTR 0x100+des_SP[eax] 	# xor in array lookup
89*0Sstevel@tonic-gatemov	eax,	ecx				# get word
90*0Sstevel@tonic-gateshr	ecx	8				# line up next byte
91*0Sstevel@tonic-gateand	eax,	0fch				# mask byte
92*0Sstevel@tonic-gatexor	edi	DWORD PTR 0x300+des_SP[eax] 	# xor in array lookup
93*0Sstevel@tonic-gate
94*0Sstevel@tonic-gateDue to the execution units in the pentium, this actually works quite well.
95*0Sstevel@tonic-gateFor a pentium pro it should be very good.  This is the type of output
96*0Sstevel@tonic-gateVisual C++ generates.
97*0Sstevel@tonic-gate
98*0Sstevel@tonic-gateThere is a third option.  instead of using
99*0Sstevel@tonic-gatemov	al,	ch
100*0Sstevel@tonic-gatewhich is bad on the pentium pro, one may be able to use
101*0Sstevel@tonic-gatemovzx	eax,	ch
102*0Sstevel@tonic-gatewhich may not incur the partial write penalty.  On the pentium,
103*0Sstevel@tonic-gatethis instruction takes 4 cycles so is not worth using but on the
104*0Sstevel@tonic-gatepentium pro it appears it may be worth while.  I need access to one to
105*0Sstevel@tonic-gateexperiment :-).
106*0Sstevel@tonic-gate
107*0Sstevel@tonic-gateeric (20 Oct 1996)
108*0Sstevel@tonic-gate
109*0Sstevel@tonic-gate22 Nov 1996 - I have asked people to run the 2 different version on pentium
110*0Sstevel@tonic-gatepros and it appears that the intel documentation is wrong.  The
111*0Sstevel@tonic-gatemov al,bh is still faster on a pentium pro, so just use the des586.pl
112*0Sstevel@tonic-gateinstall des686.pl
113*0Sstevel@tonic-gate
114*0Sstevel@tonic-gate3 Dec 1996 - I added des_encrypt3/des_decrypt3 because I have moved these
115*0Sstevel@tonic-gatefunctions into des_enc.c because it does make a massive performance
116*0Sstevel@tonic-gatedifference on some boxes to have the functions code located close to
117*0Sstevel@tonic-gatethe des_encrypt2() function.
118*0Sstevel@tonic-gate
119*0Sstevel@tonic-gate9 Jan 1997 - des-som2.pl is now the correct perl script to use for
120*0Sstevel@tonic-gatepentiums.  It contains an inner loop from
121*0Sstevel@tonic-gateSvend Olaf Mikkelsen <svolaf@inet.uni-c.dk> which does raw ecb DES calls at
122*0Sstevel@tonic-gate273,000 per second.  He had a previous version at 250,000 and the best
123*0Sstevel@tonic-gateI was able to get was 203,000.  The content has not changed, this is all
124*0Sstevel@tonic-gatedue to instruction sequencing (and actual instructions choice) which is able
125*0Sstevel@tonic-gateto keep both functional units of the pentium going.
126*0Sstevel@tonic-gateWe may have lost the ugly register usage restrictions when x86 went 32 bit
127*0Sstevel@tonic-gatebut for the pentium it has been replaced by evil instruction ordering tricks.
128*0Sstevel@tonic-gate
129*0Sstevel@tonic-gate13 Jan 1997 - des-som3.pl, more optimizations from Svend Olaf.
130*0Sstevel@tonic-gateraw DES at 281,000 per second on a pentium 100.
131*0Sstevel@tonic-gate
132