/netbsd-src/external/lgpl3/gmp/dist/mpn/x86/k7/mmx/ |
H A D | divrem_1.asm | 35 C K7: 17.0 cycles/limb integer part, 15.0 cycles/limb fraction part. label 38 C mp_limb_t mpn_divrem_1 (mp_ptr dst, mp_size_t xsize, label 39 C mp_srcptr src, mp_size_t size, label 40 C mp_limb_t divisor); label 41 C mp_limb_t mpn_divrem_1c (mp_ptr dst, mp_size_t xsize, label 42 C mp_srcptr src, mp_size_t size, label 43 C mp_limb_t divisor, mp_limb_t carry); label 44 C mp_limb_t mpn_preinv_divrem_1 (mp_ptr dst, mp_size_t xsize, label 45 C mp_srcptr src, mp_size_t size, label 46 C mp_limb_t divisor, mp_limb_t inverse, label [all …]
|
/netbsd-src/external/lgpl3/gmp/dist/mpn/pa64/ |
H A D | submul_1.asm | 34 C cycles/limb label 35 C 8000,8200: 7 label 36 C 8500,8600,8700: 6.5 label 38 C The feed-in and wind-down code has not yet been scheduled. Many cycles label 39 C could be saved there per call. label 41 C DESCRIPTION: label 42 C The main loop "BIG" is 4-way unrolled, mainly to allow label 43 C effective use of ADD,DC. Delays in moving data via the cache from the FP label 44 C registers to the IU registers, have demanded a deep software pipeline, and label 45 C a lot of stack slots for partial products in flight. label [all …]
|
H A D | addmul_1.asm | 34 C cycles/limb label 35 C 8000,8200: 7 label 36 C 8500,8600,8700: 6.375 label 38 C The feed-in and wind-down code has not yet been scheduled. Many cycles label 39 C could be saved there per call. label 41 C DESCRIPTION: label 42 C The main loop "BIG" is 4-way unrolled, mainly to allow label 43 C effective use of ADD,DC. Delays in moving data via the cache from the FP label 44 C registers to the IU registers, have demanded a deep software pipeline, and label 45 C a lot of stack slots for partial products in flight. label [all …]
|
H A D | mul_1.asm | 34 C cycles/limb label 35 C 8000,8200: 6.5 label 36 C 8500,8600,8700: 5.625 label 38 C The feed-in and wind-down code has not yet been scheduled. Many cycles label 39 C could be saved there per call. label 41 C DESCRIPTION: label 42 C The main loop "BIG" is 4-way unrolled, mainly to allow label 43 C effective use of ADD,DC. Delays in moving data via the cache from the FP label 44 C registers to the IU registers, have demanded a deep software pipeline, and label 45 C a lot of stack slots for partial products in flight. label [all …]
|
/netbsd-src/external/lgpl3/gmp/dist/mpn/ia64/ |
H A D | mode1o.asm | 36 C cycles/limb label 37 C Itanium: 15 label 38 C Itanium 2: 8 label 50 C mp_limb_t mpn_modexact_1c_odd (mp_srcptr src, mp_size_t size, label 51 C mp_limb_t divisor, mp_limb_t carry); label 52 C label 53 C The modexact algorithm is usually conceived as a dependent chain label 54 C label 55 C l = src[i] - c label 56 C q = low(l * inverse) label [all …]
|
/netbsd-src/external/lgpl3/gmp/dist/mpn/x86/k6/ |
H A D | aorsmul_1.asm | 34 C cycles/limb label 35 C P5 label 36 C P6 model 0-8,10-12 5.94 label 37 C P6 model 9 (Banias) 5.51 label 38 C P6 model 13 (Dothan) 5.57 label 39 C P4 model 0 (Willamette) label 40 C P4 model 1 (?) label 41 C P4 model 2 (Northwood) label 42 C P4 model 3 (Prescott) label 43 C P4 model 4 (Nocona) label [all …]
|
H A D | mul_basecase.asm | 34 C K6: approx 9.0 cycles per cross product on 30x30 limbs (with 16 limbs/loop label 35 C unrolling). label 51 C void mpn_mul_basecase (mp_ptr wp, label 52 C mp_srcptr xp, mp_size_t xsize, label 53 C mp_srcptr yp, mp_size_t ysize); label 54 C label 55 C Calculate xp,xsize multiplied by yp,ysize, storing the result in label 56 C wp,xsize+ysize. label 57 C label 58 C This routine is essentially the same as mpn/generic/mul_basecase.c, but label [all …]
|
H A D | sqr_basecase.asm | 34 C K6: approx 4.7 cycles per cross product, or 9.2 cycles per triangular label 35 C product (measured on the speed difference between 17 and 33 limbs, label 36 C which is roughly the Karatsuba recursing range). label 70 C void mpn_sqr_basecase (mp_ptr dst, mp_srcptr src, mp_size_t size); label 71 C label 72 C The algorithm is essentially the same as mpn/generic/sqr_basecase.c, but a label 73 C lot of function call overheads are avoided, especially when the given size label 74 C is small. label 75 C label 76 C The code size might look a bit excessive, but not all of it is executed label [all …]
|
/netbsd-src/external/lgpl3/gmp/dist/mpn/sparc64/ultrasparc1234/ |
H A D | addmul_2.asm | 34 C cycles/limb label 35 C UltraSPARC 1&2: 9 label 36 C UltraSPARC 3: 10 label 38 C Algorithm: We use 16 floating-point multiplies per limb product, with the label 39 C 2-limb v operand split into eight 16-bit pieces, and the n-limb u operand label 40 C split into 32-bit pieces. We sum four 48-bit partial products using label 41 C floating-point add, then convert the resulting four 50-bit quantities and label 42 C transfer them to the integer unit. label 44 C Possible optimizations: label 45 C 1. Align the stack area where we transfer the four 50-bit product-sums label [all …]
|
H A D | addmul_1.asm | 34 C cycles/limb label 35 C UltraSPARC 1&2: 14 label 36 C UltraSPARC 3: 17.5 label 38 C Algorithm: We use eight floating-point multiplies per limb product, with the label 39 C invariant v operand split into four 16-bit pieces, and the up operand split label 40 C into 32-bit pieces. We sum pairs of 48-bit partial products using label 41 C floating-point add, then convert the four 49-bit product-sums and transfer label 42 C them to the integer unit. label 44 C Possible optimizations: label 45 C 0. Rewrite to use algorithm of mpn_addmul_2. label [all …]
|
H A D | mul_1.asm | 34 C cycles/limb label 35 C UltraSPARC 1&2: 14 label 36 C UltraSPARC 3: 18.5 label 38 C Algorithm: We use eight floating-point multiplies per limb product, with the label 39 C invariant v operand split into four 16-bit pieces, and the s1 operand split label 40 C into 32-bit pieces. We sum pairs of 48-bit partial products using label 41 C floating-point add, then convert the four 49-bit product-sums and transfer label 42 C them to the integer unit. label 44 C Possible optimizations: label 45 C 1. Align the stack area where we transfer the four 49-bit product-sums label [all …]
|
/netbsd-src/external/lgpl3/gmp/dist/mpn/x86/pentium4/sse2/ |
H A D | divrem_1.asm | 34 C P4: 32 cycles/limb integer part, 30 cycles/limb fraction part. label 37 C mp_limb_t mpn_divrem_1 (mp_ptr dst, mp_size_t xsize, label 38 C mp_srcptr src, mp_size_t size, label 39 C mp_limb_t divisor); label 40 C mp_limb_t mpn_divrem_1c (mp_ptr dst, mp_size_t xsize, label 41 C mp_srcptr src, mp_size_t size, label 42 C mp_limb_t divisor, mp_limb_t carry); label 43 C mp_limb_t mpn_preinv_divrem_1 (mp_ptr dst, mp_size_t xsize, label 44 C mp_srcptr src, mp_size_t size, label 45 C mp_limb_t divisor, mp_limb_t inverse, label [all …]
|
H A D | popcount.asm | 35 C 32-bit popcount hamdist label 36 C cycles/limb cycles/limb label 37 C P5 - label 38 C P6 model 0-8,10-12 - label 39 C P6 model 9 (Banias) ? label 40 C P6 model 13 (Dothan) 4 label 41 C P4 model 0 (Willamette) ? label 42 C P4 model 1 (?) ? label 43 C P4 model 2 (Northwood) 3.9 label 44 C P4 model 3 (Prescott) ? label [all …]
|
/netbsd-src/external/lgpl3/gmp/dist/mpn/x86_64/core2/ |
H A D | sqr_basecase.asm | 36 C cycles/limb mul_2 addmul_2 sqr_diag_addlsh1 label 37 C AMD K8,K9 label 38 C AMD K10 label 39 C AMD bull label 40 C AMD pile label 41 C AMD steam label 42 C AMD bobcat label 43 C AMD jaguar label 44 C Intel P4 label 45 C Intel core 4.9 4.18-4.25 3.87 label [all …]
|
H A D | mul_basecase.asm | 36 C cycles/limb mul_1 mul_2 mul_3 addmul_2 label 37 C AMD K8,K9 label 38 C AMD K10 label 39 C AMD bull label 40 C AMD pile label 41 C AMD steam label 42 C AMD bobcat label 43 C AMD jaguar label 44 C Intel P4 label 45 C Intel core 4.0 4.0 - 4.18-4.25 label [all …]
|
/netbsd-src/external/lgpl3/gmp/dist/mpn/sparc32/v9/ |
H A D | sqr_diagonal.asm | 34 C INPUT PARAMETERS label 35 C rp i0 label 36 C up i1 label 37 C n i2 label 39 C This code uses a very deep software pipeline, due to the need for moving data label 40 C forth and back between the integer registers and floating-point registers. label 41 C label 42 C A VIS variant of this code would make the pipeline less deep, since the label 43 C masking now done in the integer unit could take place in the floating-point label 44 C unit using the FAND instruction. It would be possible to save several cycles label [all …]
|
/netbsd-src/external/lgpl3/gmp/dist/mpn/x86/p6/mmx/ |
H A D | divrem_1.asm | 34 C P6MMX: 25.0 cycles/limb integer part, 17.5 cycles/limb fraction part. label 37 C mp_limb_t mpn_divrem_1 (mp_ptr dst, mp_size_t xsize, label 38 C mp_srcptr src, mp_size_t size, label 39 C mp_limb_t divisor); label 40 C mp_limb_t mpn_divrem_1c (mp_ptr dst, mp_size_t xsize, label 41 C mp_srcptr src, mp_size_t size, label 42 C mp_limb_t divisor, mp_limb_t carry); label 43 C mp_limb_t mpn_preinv_divrem_1 (mp_ptr dst, mp_size_t xsize, label 44 C mp_srcptr src, mp_size_t size, label 45 C mp_limb_t divisor, mp_limb_t inverse, label [all …]
|
/netbsd-src/external/lgpl3/gmp/dist/mpn/x86/ |
H A D | divrem_1.asm | 34 C cycles/limb label 35 C 486 approx 43 maybe label 36 C P5 44 label 37 C P6 39 label 38 C P6MMX 39 label 39 C K6 22 label 40 C K7 42 label 41 C P4 58 label 44 C mp_limb_t mpn_divrem_1 (mp_ptr dst, mp_size_t xsize, label 45 C mp_srcptr src, mp_size_t size, mp_limb_t divisor); label [all …]
|
/netbsd-src/external/lgpl3/gmp/dist/mpn/x86/pentium/mmx/ |
H A D | mul_1.asm | 34 C cycles/limb label 35 C P5: 12.0 for 32-bit multiplier label 36 C 7.0 for 16-bit multiplier label 39 C mp_limb_t mpn_mul_1 (mp_ptr dst, mp_srcptr src, mp_size_t size, label 40 C mp_limb_t multiplier); label 41 C label 42 C When the multiplier is 16 bits some special case MMX code is used. Small label 43 C multipliers might arise reasonably often from mpz_mul_ui etc. If the size label 44 C is odd there's roughly a 5 cycle penalty, so times for say size==7 and label 45 C size==8 end up being quite close. If src isn't aligned to an 8 byte label [all …]
|
/netbsd-src/external/lgpl3/gmp/dist/mpn/alpha/ev5/ |
H A D | diveby3.asm | 33 C cycles/limb label 34 C EV4: 22 label 35 C EV5: 11.5 label 36 C EV6: 6.3 Note that mpn_bdiv_dbm1c is faster label 38 C TODO label 39 C * Remove the unops, they benefit just ev6, which no longer uses this file. label 40 C * Try prefetch for destination, using lds. label 41 C * Improve feed-in code, by moving initial mulq earlier; make initial load label 42 C to u0/u0 to save some copying. label 43 C * Combine u0 and u2, u1 and u3. label [all …]
|
/netbsd-src/external/lgpl3/gmp/dist/mpn/alpha/ |
H A D | mode1o.asm | 34 C cycles/limb label 35 C EV4: 47 label 36 C EV5: 30 label 37 C EV6: 15 label 40 C mp_limb_t mpn_modexact_1c_odd (mp_srcptr src, mp_size_t size, mp_limb_t d, label 41 C mp_limb_t c) label 42 C label 43 C This code follows the "alternate" code in mpn/generic/mode1o.c, label 44 C eliminating cbit+climb from the dependent chain. This leaves, label 45 C label [all …]
|
/netbsd-src/external/lgpl3/gmp/dist/mpn/x86/p6/ |
H A D | mul_basecase.asm | 34 C P6: approx 6.5 cycles per cross product (16 limbs/loop unrolling). label 46 C void mpn_mul_basecase (mp_ptr wp, label 47 C mp_srcptr xp, mp_size_t xsize, label 48 C mp_srcptr yp, mp_size_t ysize); label 49 C label 50 C This routine is essentially the same as mpn/generic/mul_basecase.c, but label 51 C it's faster because it does most of the mpn_addmul_1() startup label 52 C calculations only once. label 94 C ----------------------------------------------------------------------------- label 144 C ----------------------------------------------------------------------------- label [all …]
|
H A D | sqr_basecase.asm | 34 C P6: approx 4.0 cycles per cross product, or 7.75 cycles per triangular label 35 C product (measured on the speed difference between 20 and 40 limbs, label 36 C which is the Karatsuba recursing range). label 52 C void mpn_sqr_basecase (mp_ptr dst, mp_srcptr src, mp_size_t size); label 53 C label 54 C The algorithm is basically the same as mpn/generic/sqr_basecase.c, but a label 55 C lot of function call overheads are avoided, especially when the given size label 56 C is small. label 57 C label 58 C The code size might look a bit excessive, but not all of it is executed so label [all …]
|
/netbsd-src/external/lgpl3/gmp/dist/mpn/x86/k7/ |
H A D | mul_basecase.asm | 34 C K7: approx 4.42 cycles per cross product at around 20x20 limbs (16 label 35 C limbs/loop unrolling). label 52 C void mpn_mul_basecase (mp_ptr wp, label 53 C mp_srcptr xp, mp_size_t xsize, label 54 C mp_srcptr yp, mp_size_t ysize); label 55 C label 56 C Calculate xp,xsize multiplied by yp,ysize, storing the result in label 57 C wp,xsize+ysize. label 58 C label 59 C This routine is essentially the same as mpn/generic/mul_basecase.c, but label [all …]
|
/netbsd-src/external/lgpl3/gmp/dist/mpn/x86_64/k8/ |
H A D | sqr_basecase.asm | 35 C The inner loops of this code are the result of running a code generation and label 36 C optimization tool suite written by David Harvey and Torbjorn Granlund. label 38 C NOTES label 39 C * There is a major stupidity in that we call mpn_mul_1 initially, for a label 40 C large trip count. Instead, we should follow the generic/sqr_basecase.c label 41 C code which uses addmul_2s from the start, conditionally leaving a 1x1 label 42 C multiply to the end. (In assembly code, one would stop invoking label 43 C addmul_2s loops when perhaps 3x2s respectively a 2x2s remains.) label 44 C * Another stupidity is in the sqr_diag_addlsh1 code. It does not need to label 45 C save/restore carry, instead it can propagate into the high product word. label [all …]
|