xref: /netbsd-src/external/lgpl3/gmp/dist/mpn/s390_64/README (revision b83ebeba7f767758d2778bb0f9d7a76534253621)
1Copyright 2011 Free Software Foundation, Inc.
2
3This file is part of the GNU MP Library.
4
5The GNU MP Library is free software; you can redistribute it and/or modify
6it under the terms of the GNU Lesser General Public License as published by
7the Free Software Foundation; either version 3 of the License, or (at your
8option) any later version.
9
10The GNU MP Library is distributed in the hope that it will be useful, but
11WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
12or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
13License for more details.
14
15You should have received a copy of the GNU Lesser General Public License
16along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.
17
18
19
20There are 5 generations of 64-but s390 processors, z900, z990, z9,
21z10, and z196.  The current GMP code was optimised for the two oldest,
22z900 and z990.
23
24
25mpn_copyi
26
27This code makes use of a loop around MVC.  It almost surely runs very
28close to optimally.  A small improvement could be done by using one
29MVC for size 256 bytes, now we use two (we use an extra MVC when
30copying any multiple of 256 bytes).
31
32
33mpn_copyd
34
35We have tried several feed-in variants here, branch tree, jump table
36and computed goto.  The fastest (on z990) turned out to be computed
37goto.
38
39An approach not tried is EX of LMG and STMG, modifying the register set
40on-the-fly.  Using that trick, we could completely avoid using
41separate feed-in paths.
42
43
44mpn_lshift, mpn_rshift
45
46The current code runs at pipeline decode bandwith on z990.
47
48
49mpn_add_n, mpn_sub_n
50
51The current code is 4-way unrolled.  It should be unrolled more, at
52least 8x, in order to reach 2.5 c/l.
53
54
55mpn_mul_1, mpn_addmul_1, mpn_submul_1
56
57The current code is very naive, but due to the non-pipelined nature of
58MLGR on z900 and z990, more sophisticated code would not gain much.
59
60On z10 one would need to cluster at least 4 MLGR together, in order to
61reduce stalling.
62
63On z196, one surely want to use unrolling and pipelining, to perhaps
64reach around 12 c/l.  A major issue here and on z10 is ALCGR's 3 cycle
65stalling.
66
67
68mpn_mul_2, mpn_addmul_2
69
70At least for older machines (z900, z990) with very slow MLGR, we
71should use Karatsuba's algorithm on 2-limb units, making mul_2 and
72addmul_2 the main multiplicaton primitives.  The newer machines might
73benefit less from this approach, perhaps in particular z10, where MLGR
74clustering is more important.
75
76With Karatsuba, one could hope for around 16 cycles per accumulated
77128 cross product, on z990.
78