Name Date Size #Lines LOC

..--

cfp/H09-Jul-2024-655542

ieee/H09-Jul-2024-744421

READMEH A D09-Jul-20243.8 KiB12295

add_n.cH A D09-Jul-20242.5 KiB9145

gmp-mparam.hH A D09-Jul-20242.7 KiB7532

hamdist.cH A D09-Jul-20241.3 KiB4311

lshift.cH A D09-Jul-20241.6 KiB5923

mulww.fH A D09-Jul-20242.2 KiB6424

popcount.cH A D09-Jul-20241.2 KiB4311

rshift.cH A D09-Jul-20241.6 KiB5923

sub_n.cH A D09-Jul-20242.6 KiB9145

README

1Copyright 2000-2002 Free Software Foundation, Inc.
2
3This file is part of the GNU MP Library.
4
5The GNU MP Library is free software; you can redistribute it and/or modify
6it under the terms of either:
7
8  * the GNU Lesser General Public License as published by the Free
9    Software Foundation; either version 3 of the License, or (at your
10    option) any later version.
11
12or
13
14  * the GNU General Public License as published by the Free Software
15    Foundation; either version 2 of the License, or (at your option) any
16    later version.
17
18or both in parallel, as here.
19
20The GNU MP Library is distributed in the hope that it will be useful, but
21WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
22or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
23for more details.
24
25You should have received copies of the GNU General Public License and the
26GNU Lesser General Public License along with the GNU MP Library.  If not,
27see https://www.gnu.org/licenses/.
28
29
30
31
32
33
34The code in this directory works for Cray vector systems such as C90,
35J90, T90 (both the CFP variant and the IEEE variant) and SV1.  (For
36the T3E and T3D systems, see the `alpha' subdirectory at the same
37level as the directory containing this file.)
38
39The cfp subdirectory is for systems utilizing the traditional Cray
40floating-point format, and the ieee subdirectory is for the newer
41systems that use the IEEE floating-point format.
42
43There are several issues that reduces speed on Cray systems.  For
44systems with cfp floating point, the main obstacle is the forming of
45128-bit products.  For IEEE systems, adding, and in particular
46computing carry is the main issue.  There are no vectorizing
47unsigned-less-than instructions, and the sequence that implement that
48operation is very long.
49
50Shifting is the only operation that is simple to make fast.  All Cray
51systems have a bitblt instructions (Vi Vj,Vj<Ak and Vi Vj,Vj>Ak) that
52should be really useful.
53
54For best speed for cfp systems, we need a mul_basecase, since that
55reduces the need for carry propagation to a minimum.  Depending on the
56size (vn) of the smaller of the two operands (V), we should split U and V
57in different chunk sizes:
58
59U split in 2 32-bit parts
60V split according to the table:
61parts			4	5	6	7	8
62bits/part		16	13	11	10	8
63max allowed vn		1	8	32	64	256
64number of multiplies	8	10	12	14	16
65peak cycles/limb	4	5	6	7	8
66
67U split in 3 22-bit parts
68V split according to the table:
69parts			3	4	5
70bits/part		22	16	13
71max allowed vn		16	1024	8192
72number of multiplies	9	12	15
73peak cycles/limb	4.5	6	7.5
74
75U split in 4 16-bit parts
76V split according to the table:
77parts			4
78bits/part		16
79max allowed vn		65536
80number of multiplies	16
81peak cycles/limb	8
82
83(A T90 CPU can accumulate two products per cycle.)
84
85IDEA:
86* Rewrite mpn_add_n:
87    short cy[n + 1];
88    #pragma _CRI ivdep
89      for (i = 0; i < n; i++)
90	{ s = up[i] + vp[i];
91	  rp[i] = s;
92	  cy[i + 1] = s < up[i]; }
93      more_carries = 0;
94    #pragma _CRI ivdep
95      for (i = 1; i < n; i++)
96	{ s = rp[i] + cy[i];
97	  rp[i] = s;
98	  more_carries += s < cy[i]; }
99      cys = 0;
100      if (more_carries)
101	{
102	  cys = rp[1] < cy[1];
103	  for (i = 2; i < n; i++)
104	    { rp[i] += cys;
105	      cys = rp[i] < cys; }
106	}
107      return cys + cy[n];
108
109* Write mpn_add3_n for adding three operands.  First add operands 1
110  and 2, and generate cy[].  Then add operand 3 to the partial result,
111  and accumulate carry into cy[].  Finally propagate carry just like
112  in the new mpn_add_n.
113
114IDEA:
115
116Store fewer bits, perhaps 62, per limb.  That brings mpn_add_n time
117down to 2.5 cycles/limb and mpn_addmul_1 times to 4 cycles/limb.  By
118storing even fewer bits per limb, perhaps 56, it would be possible to
119write a mul_mul_basecase that would run at effectively 1 cycle/limb.
120(Use VM here to better handle the romb-shaped multiply area, perhaps
121rounding operand sizes up to the next power of 2.)
122