xref: /netbsd-src/external/lgpl3/gmp/dist/mpn/cray/README (revision ead2c0eee3abe6bcf08c63bfc78eb8a93a579b2b)
1Copyright 2000, 2001, 2002 Free Software Foundation, Inc.
2
3This file is part of the GNU MP Library.
4
5The GNU MP Library is free software; you can redistribute it and/or modify
6it under the terms of the GNU Lesser General Public License as published by
7the Free Software Foundation; either version 3 of the License, or (at your
8option) any later version.
9
10The GNU MP Library is distributed in the hope that it will be useful, but
11WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
12or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public
13License for more details.
14
15You should have received a copy of the GNU Lesser General Public License
16along with the GNU MP Library.  If not, see http://www.gnu.org/licenses/.
17
18
19
20
21
22
23The code in this directory works for Cray vector systems such as C90,
24J90, T90 (both the CFP variant and the IEEE variant) and SV1.  (For
25the T3E and T3D systems, see the `alpha' subdirectory at the same
26level as the directory containing this file.)
27
28The cfp subdirectory is for systems utilizing the traditional Cray
29floating-point format, and the ieee subdirectory is for the newer
30systems that use the IEEE floating-point format.
31
32There are several issues that reduces speed on Cray systems.  For
33systems with cfp floating point, the main obstacle is the forming of
34128-bit products.  For IEEE systems, adding, and in particular
35computing carry is the main issue.  There are no vectorizing
36unsigned-less-than instructions, and the sequence that implement that
37operation is very long.
38
39Shifting is the only operation that is simple to make fast.  All Cray
40systems have a bitblt instructions (Vi Vj,Vj<Ak and Vi Vj,Vj>Ak) that
41should be really useful.
42
43For best speed for cfp systems, we need a mul_basecase, since that
44reduces the need for carry propagation to a minimum.  Depending on the
45size (vn) of the smaller of the two operands (V), we should split U and V
46in different chunk sizes:
47
48U split in 2 32-bit parts
49V split according to the table:
50parts			4	5	6	7	8
51bits/part		16	13	11	10	8
52max allowed vn		1	8	32	64	256
53number of multiplies	8	10	12	14	16
54peak cycles/limb	4	5	6	7	8
55
56U split in 3 22-bit parts
57V split according to the table:
58parts			3	4	5
59bits/part		22	16	13
60max allowed vn		16	1024	8192
61number of multiplies	9	12	15
62peak cycles/limb	4.5	6	7.5
63
64U split in 4 16-bit parts
65V split according to the table:
66parts			4
67bits/part		16
68max allowed vn		65536
69number of multiplies	16
70peak cycles/limb	8
71
72(A T90 CPU can accumulate two products per cycle.)
73
74IDEA:
75* Rewrite mpn_add_n:
76    short cy[n + 1];
77    #pragma _CRI ivdep
78      for (i = 0; i < n; i++)
79	{ s = up[i] + vp[i];
80	  rp[i] = s;
81	  cy[i + 1] = s < up[i]; }
82      more_carries = 0;
83    #pragma _CRI ivdep
84      for (i = 1; i < n; i++)
85	{ s = rp[i] + cy[i];
86	  rp[i] = s;
87	  more_carries += s < cy[i]; }
88      cys = 0;
89      if (more_carries)
90	{
91	  cys = rp[1] < cy[1];
92	  for (i = 2; i < n; i++)
93	    { rp[i] += cys;
94	      cys = rp[i] < cys; }
95	}
96      return cys + cy[n];
97
98* Write mpn_add3_n for adding three operands.  First add operands 1
99  and 2, and generate cy[].  Then add operand 3 to the partial result,
100  and accumulate carry into cy[].  Finally propagate carry just like
101  in the new mpn_add_n.
102
103IDEA:
104
105Store fewer bits, perhaps 62, per limb.  That brings mpn_add_n time
106down to 2.5 cycles/limb and mpn_addmul_1 times to 4 cycles/limb.  By
107storing even fewer bits per limb, perhaps 56, it would be possible to
108write a mul_mul_basecase that would run at effectively 1 cycle/limb.
109(Use VM here to better handle the romb-shaped multiply area, perhaps
110rouding operand sizes up to the next power of 2.)
111