xref: /netbsd-src/external/lgpl3/gmp/dist/mpn/s390_64/README (revision eceb233b9bd0dfebb902ed73b531ae6964fa3f9b)
1Copyright 2011 Free Software Foundation, Inc.
2
3This file is part of the GNU MP Library.
4
5The GNU MP Library is free software; you can redistribute it and/or modify
6it under the terms of either:
7
8  * the GNU Lesser General Public License as published by the Free
9    Software Foundation; either version 3 of the License, or (at your
10    option) any later version.
11
12or
13
14  * the GNU General Public License as published by the Free Software
15    Foundation; either version 2 of the License, or (at your option) any
16    later version.
17
18or both in parallel, as here.
19
20The GNU MP Library is distributed in the hope that it will be useful, but
21WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
22or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
23for more details.
24
25You should have received copies of the GNU General Public License and the
26GNU Lesser General Public License along with the GNU MP Library.  If not,
27see https://www.gnu.org/licenses/.
28
29
30
31There are 5 generations of 64-but s390 processors, z900, z990, z9,
32z10, and z196.  The current GMP code was optimised for the two oldest,
33z900 and z990.
34
35
36mpn_copyi
37
38This code makes use of a loop around MVC.  It almost surely runs very
39close to optimally.  A small improvement could be done by using one
40MVC for size 256 bytes, now we use two (we use an extra MVC when
41copying any multiple of 256 bytes).
42
43
44mpn_copyd
45
46We have tried several feed-in variants here, branch tree, jump table
47and computed goto.  The fastest (on z990) turned out to be computed
48goto.
49
50An approach not tried is EX of LMG and STMG, modifying the register set
51on-the-fly.  Using that trick, we could completely avoid using
52separate feed-in paths.
53
54
55mpn_lshift, mpn_rshift
56
57The current code runs at pipeline decode bandwidth on z990.
58
59
60mpn_add_n, mpn_sub_n
61
62The current code is 4-way unrolled.  It should be unrolled more, at
63least 8x, in order to reach 2.5 c/l.
64
65
66mpn_mul_1, mpn_addmul_1, mpn_submul_1
67
68The current code is very naive, but due to the non-pipelined nature of
69MLGR on z900 and z990, more sophisticated code would not gain much.
70
71On z10 one would need to cluster at least 4 MLGR together, in order to
72reduce stalling.
73
74On z196, one surely want to use unrolling and pipelining, to perhaps
75reach around 12 c/l.  A major issue here and on z10 is ALCGR's 3 cycle
76stalling.
77
78
79mpn_mul_2, mpn_addmul_2
80
81At least for older machines (z900, z990) with very slow MLGR, we
82should use Karatsuba's algorithm on 2-limb units, making mul_2 and
83addmul_2 the main multiplication primitives.  The newer machines might
84benefit less from this approach, perhaps in particular z10, where MLGR
85clustering is more important.
86
87With Karatsuba, one could hope for around 16 cycles per accumulated
88128 cross product, on z990.
89