Home
last modified time | relevance | path

Searched defs:C (Results 1 – 25 of 2077) sorted by relevance

12345678910>>...84

/netbsd-src/external/lgpl3/gmp/dist/mpn/x86/k7/mmx/
H A Ddivrem_1.asm35 C K7: 17.0 cycles/limb integer part, 15.0 cycles/limb fraction part. label
38 C mp_limb_t mpn_divrem_1 (mp_ptr dst, mp_size_t xsize, label
39 C mp_srcptr src, mp_size_t size, label
40 C mp_limb_t divisor); label
41 C mp_limb_t mpn_divrem_1c (mp_ptr dst, mp_size_t xsize, label
42 C mp_srcptr src, mp_size_t size, label
43 C mp_limb_t divisor, mp_limb_t carry); label
44 C mp_limb_t mpn_preinv_divrem_1 (mp_ptr dst, mp_size_t xsize, label
45 C mp_srcptr src, mp_size_t size, label
46 C mp_limb_t divisor, mp_limb_t inverse, label
[all …]
/netbsd-src/external/lgpl3/gmp/dist/mpn/pa64/
H A Dsubmul_1.asm34 C cycles/limb label
35 C 8000,8200: 7 label
36 C 8500,8600,8700: 6.5 label
38 C The feed-in and wind-down code has not yet been scheduled. Many cycles label
39 C could be saved there per call. label
41 C DESCRIPTION: label
42 C The main loop "BIG" is 4-way unrolled, mainly to allow label
43 C effective use of ADD,DC. Delays in moving data via the cache from the FP label
44 C registers to the IU registers, have demanded a deep software pipeline, and label
45 C a lot of stack slots for partial products in flight. label
[all …]
H A Daddmul_1.asm34 C cycles/limb label
35 C 8000,8200: 7 label
36 C 8500,8600,8700: 6.375 label
38 C The feed-in and wind-down code has not yet been scheduled. Many cycles label
39 C could be saved there per call. label
41 C DESCRIPTION: label
42 C The main loop "BIG" is 4-way unrolled, mainly to allow label
43 C effective use of ADD,DC. Delays in moving data via the cache from the FP label
44 C registers to the IU registers, have demanded a deep software pipeline, and label
45 C a lot of stack slots for partial products in flight. label
[all …]
H A Dmul_1.asm34 C cycles/limb label
35 C 8000,8200: 6.5 label
36 C 8500,8600,8700: 5.625 label
38 C The feed-in and wind-down code has not yet been scheduled. Many cycles label
39 C could be saved there per call. label
41 C DESCRIPTION: label
42 C The main loop "BIG" is 4-way unrolled, mainly to allow label
43 C effective use of ADD,DC. Delays in moving data via the cache from the FP label
44 C registers to the IU registers, have demanded a deep software pipeline, and label
45 C a lot of stack slots for partial products in flight. label
[all …]
/netbsd-src/external/lgpl3/gmp/dist/mpn/ia64/
H A Dmode1o.asm36 C cycles/limb label
37 C Itanium: 15 label
38 C Itanium 2: 8 label
50 C mp_limb_t mpn_modexact_1c_odd (mp_srcptr src, mp_size_t size, label
51 C mp_limb_t divisor, mp_limb_t carry); label
52 C label
53 C The modexact algorithm is usually conceived as a dependent chain label
54 C label
55 C l = src[i] - c label
56 C q = low(l * inverse) label
[all …]
/netbsd-src/external/lgpl3/gmp/dist/mpn/x86/k6/
H A Daorsmul_1.asm34 C cycles/limb label
35 C P5 label
36 C P6 model 0-8,10-12 5.94 label
37 C P6 model 9 (Banias) 5.51 label
38 C P6 model 13 (Dothan) 5.57 label
39 C P4 model 0 (Willamette) label
40 C P4 model 1 (?) label
41 C P4 model 2 (Northwood) label
42 C P4 model 3 (Prescott) label
43 C P4 model 4 (Nocona) label
[all …]
H A Dmul_basecase.asm34 C K6: approx 9.0 cycles per cross product on 30x30 limbs (with 16 limbs/loop label
35 C unrolling). label
51 C void mpn_mul_basecase (mp_ptr wp, label
52 C mp_srcptr xp, mp_size_t xsize, label
53 C mp_srcptr yp, mp_size_t ysize); label
54 C label
55 C Calculate xp,xsize multiplied by yp,ysize, storing the result in label
56 C wp,xsize+ysize. label
57 C label
58 C This routine is essentially the same as mpn/generic/mul_basecase.c, but label
[all …]
H A Dsqr_basecase.asm34 C K6: approx 4.7 cycles per cross product, or 9.2 cycles per triangular label
35 C product (measured on the speed difference between 17 and 33 limbs, label
36 C which is roughly the Karatsuba recursing range). label
70 C void mpn_sqr_basecase (mp_ptr dst, mp_srcptr src, mp_size_t size); label
71 C label
72 C The algorithm is essentially the same as mpn/generic/sqr_basecase.c, but a label
73 C lot of function call overheads are avoided, especially when the given size label
74 C is small. label
75 C label
76 C The code size might look a bit excessive, but not all of it is executed label
[all …]
/netbsd-src/external/lgpl3/gmp/dist/mpn/sparc64/ultrasparc1234/
H A Daddmul_2.asm34 C cycles/limb label
35 C UltraSPARC 1&2: 9 label
36 C UltraSPARC 3: 10 label
38 C Algorithm: We use 16 floating-point multiplies per limb product, with the label
39 C 2-limb v operand split into eight 16-bit pieces, and the n-limb u operand label
40 C split into 32-bit pieces. We sum four 48-bit partial products using label
41 C floating-point add, then convert the resulting four 50-bit quantities and label
42 C transfer them to the integer unit. label
44 C Possible optimizations: label
45 C 1. Align the stack area where we transfer the four 50-bit product-sums label
[all …]
H A Daddmul_1.asm34 C cycles/limb label
35 C UltraSPARC 1&2: 14 label
36 C UltraSPARC 3: 17.5 label
38 C Algorithm: We use eight floating-point multiplies per limb product, with the label
39 C invariant v operand split into four 16-bit pieces, and the up operand split label
40 C into 32-bit pieces. We sum pairs of 48-bit partial products using label
41 C floating-point add, then convert the four 49-bit product-sums and transfer label
42 C them to the integer unit. label
44 C Possible optimizations: label
45 C 0. Rewrite to use algorithm of mpn_addmul_2. label
[all …]
H A Dmul_1.asm34 C cycles/limb label
35 C UltraSPARC 1&2: 14 label
36 C UltraSPARC 3: 18.5 label
38 C Algorithm: We use eight floating-point multiplies per limb product, with the label
39 C invariant v operand split into four 16-bit pieces, and the s1 operand split label
40 C into 32-bit pieces. We sum pairs of 48-bit partial products using label
41 C floating-point add, then convert the four 49-bit product-sums and transfer label
42 C them to the integer unit. label
44 C Possible optimizations: label
45 C 1. Align the stack area where we transfer the four 49-bit product-sums label
[all …]
/netbsd-src/external/lgpl3/gmp/dist/mpn/x86/pentium4/sse2/
H A Ddivrem_1.asm34 C P4: 32 cycles/limb integer part, 30 cycles/limb fraction part. label
37 C mp_limb_t mpn_divrem_1 (mp_ptr dst, mp_size_t xsize, label
38 C mp_srcptr src, mp_size_t size, label
39 C mp_limb_t divisor); label
40 C mp_limb_t mpn_divrem_1c (mp_ptr dst, mp_size_t xsize, label
41 C mp_srcptr src, mp_size_t size, label
42 C mp_limb_t divisor, mp_limb_t carry); label
43 C mp_limb_t mpn_preinv_divrem_1 (mp_ptr dst, mp_size_t xsize, label
44 C mp_srcptr src, mp_size_t size, label
45 C mp_limb_t divisor, mp_limb_t inverse, label
[all …]
H A Dpopcount.asm35 C 32-bit popcount hamdist label
36 C cycles/limb cycles/limb label
37 C P5 - label
38 C P6 model 0-8,10-12 - label
39 C P6 model 9 (Banias) ? label
40 C P6 model 13 (Dothan) 4 label
41 C P4 model 0 (Willamette) ? label
42 C P4 model 1 (?) ? label
43 C P4 model 2 (Northwood) 3.9 label
44 C P4 model 3 (Prescott) ? label
[all …]
/netbsd-src/external/lgpl3/gmp/dist/mpn/x86_64/core2/
H A Dsqr_basecase.asm36 C cycles/limb mul_2 addmul_2 sqr_diag_addlsh1 label
37 C AMD K8,K9 label
38 C AMD K10 label
39 C AMD bull label
40 C AMD pile label
41 C AMD steam label
42 C AMD bobcat label
43 C AMD jaguar label
44 C Intel P4 label
45 C Intel core 4.9 4.18-4.25 3.87 label
[all …]
H A Dmul_basecase.asm36 C cycles/limb mul_1 mul_2 mul_3 addmul_2 label
37 C AMD K8,K9 label
38 C AMD K10 label
39 C AMD bull label
40 C AMD pile label
41 C AMD steam label
42 C AMD bobcat label
43 C AMD jaguar label
44 C Intel P4 label
45 C Intel core 4.0 4.0 - 4.18-4.25 label
[all …]
/netbsd-src/external/lgpl3/gmp/dist/mpn/sparc32/v9/
H A Dsqr_diagonal.asm34 C INPUT PARAMETERS label
35 C rp i0 label
36 C up i1 label
37 C n i2 label
39 C This code uses a very deep software pipeline, due to the need for moving data label
40 C forth and back between the integer registers and floating-point registers. label
41 C label
42 C A VIS variant of this code would make the pipeline less deep, since the label
43 C masking now done in the integer unit could take place in the floating-point label
44 C unit using the FAND instruction. It would be possible to save several cycles label
[all …]
/netbsd-src/external/lgpl3/gmp/dist/mpn/x86/p6/mmx/
H A Ddivrem_1.asm34 C P6MMX: 25.0 cycles/limb integer part, 17.5 cycles/limb fraction part. label
37 C mp_limb_t mpn_divrem_1 (mp_ptr dst, mp_size_t xsize, label
38 C mp_srcptr src, mp_size_t size, label
39 C mp_limb_t divisor); label
40 C mp_limb_t mpn_divrem_1c (mp_ptr dst, mp_size_t xsize, label
41 C mp_srcptr src, mp_size_t size, label
42 C mp_limb_t divisor, mp_limb_t carry); label
43 C mp_limb_t mpn_preinv_divrem_1 (mp_ptr dst, mp_size_t xsize, label
44 C mp_srcptr src, mp_size_t size, label
45 C mp_limb_t divisor, mp_limb_t inverse, label
[all …]
/netbsd-src/external/lgpl3/gmp/dist/mpn/x86/
H A Ddivrem_1.asm34 C cycles/limb label
35 C 486 approx 43 maybe label
36 C P5 44 label
37 C P6 39 label
38 C P6MMX 39 label
39 C K6 22 label
40 C K7 42 label
41 C P4 58 label
44 C mp_limb_t mpn_divrem_1 (mp_ptr dst, mp_size_t xsize, label
45 C mp_srcptr src, mp_size_t size, mp_limb_t divisor); label
[all …]
/netbsd-src/external/lgpl3/gmp/dist/mpn/x86/pentium/mmx/
H A Dmul_1.asm34 C cycles/limb label
35 C P5: 12.0 for 32-bit multiplier label
36 C 7.0 for 16-bit multiplier label
39 C mp_limb_t mpn_mul_1 (mp_ptr dst, mp_srcptr src, mp_size_t size, label
40 C mp_limb_t multiplier); label
41 C label
42 C When the multiplier is 16 bits some special case MMX code is used. Small label
43 C multipliers might arise reasonably often from mpz_mul_ui etc. If the size label
44 C is odd there's roughly a 5 cycle penalty, so times for say size==7 and label
45 C size==8 end up being quite close. If src isn't aligned to an 8 byte label
[all …]
/netbsd-src/external/lgpl3/gmp/dist/mpn/alpha/ev5/
H A Ddiveby3.asm33 C cycles/limb label
34 C EV4: 22 label
35 C EV5: 11.5 label
36 C EV6: 6.3 Note that mpn_bdiv_dbm1c is faster label
38 C TODO label
39 C * Remove the unops, they benefit just ev6, which no longer uses this file. label
40 C * Try prefetch for destination, using lds. label
41 C * Improve feed-in code, by moving initial mulq earlier; make initial load label
42 C to u0/u0 to save some copying. label
43 C * Combine u0 and u2, u1 and u3. label
[all …]
/netbsd-src/external/lgpl3/gmp/dist/mpn/alpha/
H A Dmode1o.asm34 C cycles/limb label
35 C EV4: 47 label
36 C EV5: 30 label
37 C EV6: 15 label
40 C mp_limb_t mpn_modexact_1c_odd (mp_srcptr src, mp_size_t size, mp_limb_t d, label
41 C mp_limb_t c) label
42 C label
43 C This code follows the "alternate" code in mpn/generic/mode1o.c, label
44 C eliminating cbit+climb from the dependent chain. This leaves, label
45 C label
[all …]
/netbsd-src/external/lgpl3/gmp/dist/mpn/x86/p6/
H A Dmul_basecase.asm34 C P6: approx 6.5 cycles per cross product (16 limbs/loop unrolling). label
46 C void mpn_mul_basecase (mp_ptr wp, label
47 C mp_srcptr xp, mp_size_t xsize, label
48 C mp_srcptr yp, mp_size_t ysize); label
49 C label
50 C This routine is essentially the same as mpn/generic/mul_basecase.c, but label
51 C it's faster because it does most of the mpn_addmul_1() startup label
52 C calculations only once. label
94 C ----------------------------------------------------------------------------- label
144 C ----------------------------------------------------------------------------- label
[all …]
H A Dsqr_basecase.asm34 C P6: approx 4.0 cycles per cross product, or 7.75 cycles per triangular label
35 C product (measured on the speed difference between 20 and 40 limbs, label
36 C which is the Karatsuba recursing range). label
52 C void mpn_sqr_basecase (mp_ptr dst, mp_srcptr src, mp_size_t size); label
53 C label
54 C The algorithm is basically the same as mpn/generic/sqr_basecase.c, but a label
55 C lot of function call overheads are avoided, especially when the given size label
56 C is small. label
57 C label
58 C The code size might look a bit excessive, but not all of it is executed so label
[all …]
/netbsd-src/external/lgpl3/gmp/dist/mpn/x86/k7/
H A Dmul_basecase.asm34 C K7: approx 4.42 cycles per cross product at around 20x20 limbs (16 label
35 C limbs/loop unrolling). label
52 C void mpn_mul_basecase (mp_ptr wp, label
53 C mp_srcptr xp, mp_size_t xsize, label
54 C mp_srcptr yp, mp_size_t ysize); label
55 C label
56 C Calculate xp,xsize multiplied by yp,ysize, storing the result in label
57 C wp,xsize+ysize. label
58 C label
59 C This routine is essentially the same as mpn/generic/mul_basecase.c, but label
[all …]
/netbsd-src/external/lgpl3/gmp/dist/mpn/x86_64/k8/
H A Dsqr_basecase.asm35 C The inner loops of this code are the result of running a code generation and label
36 C optimization tool suite written by David Harvey and Torbjorn Granlund. label
38 C NOTES label
39 C * There is a major stupidity in that we call mpn_mul_1 initially, for a label
40 C large trip count. Instead, we should follow the generic/sqr_basecase.c label
41 C code which uses addmul_2s from the start, conditionally leaving a 1x1 label
42 C multiply to the end. (In assembly code, one would stop invoking label
43 C addmul_2s loops when perhaps 3x2s respectively a 2x2s remains.) label
44 C * Another stupidity is in the sqr_diag_addlsh1 code. It does not need to label
45 C save/restore carry, instead it can propagate into the high product word. label
[all …]

12345678910>>...84