#
aa295216 |
| 29-Jan-2025 |
Jay Foad <jay.foad@amd.com> |
Fix typo "tranpose" (#124929)
|
#
2e6cc79f |
| 23-Jan-2025 |
Durgadoss R <durgadossr@nvidia.com> |
[MLIR][NVVM] Migrate CpAsyncOp to intrinsics (#123789)
Intrinsics are available for the 'cpSize'
variants also. So, this patch migrates the Op
to lower to the intrinsics for all cases.
* Update
[MLIR][NVVM] Migrate CpAsyncOp to intrinsics (#123789)
Intrinsics are available for the 'cpSize'
variants also. So, this patch migrates the Op
to lower to the intrinsics for all cases.
* Update the existing tests to check the lowering to intrinsics.
* Add newer cp_async_zfill tests to verify the lowering for the 'cpSize'
variants.
* Tidy-up CHECK lines in cp_async() function in nvvmir.mlir (NFC)
PTX spec link:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
show more ...
|
#
6dcb2a09 |
| 17-Jan-2025 |
Durgadoss R <durgadossr@nvidia.com> |
[MLIR][NVVM] Add Float to TF32 conversion Op (#123199)
PR #121507 added 'cvt' intrinsics to convert
float to tf32, with the valid set of rounding and
saturation modes. This PR adds an NVVM Dialect
[MLIR][NVVM] Add Float to TF32 conversion Op (#123199)
PR #121507 added 'cvt' intrinsics to convert
float to tf32, with the valid set of rounding and
saturation modes. This PR adds an NVVM Dialect Op
for the same.
* lit tests are added to verify the lowering to intrinsics.
* Negative tests are also added to check the error-handling of invalid
combinations.
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
show more ...
|
#
d03f35f9 |
| 13-Jan-2025 |
xiaoleis-nv <99947620+xiaoleis-nv@users.noreply.github.com> |
[MLIR][NVVM] Fix the datatype error for nvvm.mma.sync when the operand is bf16 (#122664)
The PR fixes the datatype error for `nvvm.mma.sync` when the operand is
`bf16`. This operation originally re
[MLIR][NVVM] Fix the datatype error for nvvm.mma.sync when the operand is bf16 (#122664)
The PR fixes the datatype error for `nvvm.mma.sync` when the operand is
`bf16`. This operation originally requires the A/B type to be `f16x2`
for the `bf16` MMA. However, it violates the NVVM intrinsic
[[here](https://github.com/xiaoleis-nv/llvm-project/blob/372044ee09d39942925824f8f335aef40bfe92f0/llvm/include/llvm/IR/IntrinsicsNVVM.td#L119)],
where the A/B operand type should be `i32`. This is a bug, and there are
no tests in MLIR that cover this datatype.
```
// mma bf16 -> s32 @ m16n8k16/m16n8k8
!eq(gft,"m16n8k16:a:bf16") : !listsplat(llvm_i32_ty, 4),
!eq(gft,"m16n8k16:b:bf16") : !listsplat(llvm_i32_ty, 2),
!eq(gft,"m16n8k8:a:bf16") : !listsplat(llvm_i32_ty, 2),
!eq(gft,"m16n8k8:b:bf16") : [llvm_i32_ty],
```
This PR addresses this bug and adds tests to guarantee correctness.
Co-authored-by: Xiaolei Shi <xiaoleis@nvidia.com>
show more ...
|
#
66e41a1a |
| 10-Jan-2025 |
Guray Ozen <guray.ozen@gmail.com> |
[MLIR][NVVM] Declare InferIntRangeInterface for RangeableRegisterOp (#122263)
|
#
b4b819ce |
| 11-Dec-2024 |
Durgadoss R <durgadossr@nvidia.com> |
[MLIR][NVVM] Add Op for TMA Store with reduction (#118853)
PR #116854 adds intrinsics for TMA Store with reduction. This patch adds an NVVM Dialect Op for the same.
* Lit tests are added to verify
[MLIR][NVVM] Add Op for TMA Store with reduction (#118853)
PR #116854 adds intrinsics for TMA Store with reduction. This patch adds an NVVM Dialect Op for the same.
* Lit tests are added to verify the lowering to LLVM intrinsics and invalid cases. * The common verifier method is updated to handle im2col modes without offsets. This helps Ops like TMA Store, TMA StoreReduce etc. * The nvvmir.mlir test file is already large. So, this patch adds the tests for this Op in a new file under a separate "nvvm/" directory. [mlir/test/Target/LLVMIR/"nvvm"/tma_store_reduce.mlir]
PTX Spec reference:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
show more ...
|
#
81055ff0 |
| 20-Nov-2024 |
arthurqiu <arthurq@nvidia.com> |
[mlir][nvvm] Add attributes for cluster dimension PTX directives (#116973)
PTX programming models provides cluster dimension directives, which are
leveraged by the downstream `ptxas` compiler. See
[mlir][nvvm] Add attributes for cluster dimension PTX directives (#116973)
PTX programming models provides cluster dimension directives, which are
leveraged by the downstream `ptxas` compiler. See
https://docs.nvidia.com/cuda/nvvm-ir-spec/#supported-properties and
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cluster-dimension-directives
This PR introduces the cluster dimension directives to MLIR's NVVM
dialect as listed below:
```
cluster_dim_{x,y,z} -> exact number of CTAs per cluster
cluster_max_blocks -> max number of CTAs per cluster
```
show more ...
|
#
1b23ebe0 |
| 15-Nov-2024 |
Durgadoss R <durgadossr@nvidia.com> |
[MLIR][NVVM] Add Op for TMA Prefetch (#116232)
PR #115527 adds intrinsics for TMA prefetch.
This patch adds an NVVM Dialect Op for the same.
Lit tests to verify the lowering to LLVM intrinsics a
[MLIR][NVVM] Add Op for TMA Prefetch (#116232)
PR #115527 adds intrinsics for TMA prefetch.
This patch adds an NVVM Dialect Op for the same.
Lit tests to verify the lowering to LLVM intrinsics as well as
verifier tests (for invalid cases) are added.
PTX Spec reference:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-prefetch-tensor
Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
show more ...
|
#
884221ed |
| 16-Sep-2024 |
JOE1994 <joseph942010@gmail.com> |
[mlir] Tidy uses of llvm::raw_stream_ostream (NFC)
As specified in the docs, 1) raw_string_ostream is always unbuffered and 2) the underlying buffer may be used directly
( 65b13610a5226b84889b923ba
[mlir] Tidy uses of llvm::raw_stream_ostream (NFC)
As specified in the docs, 1) raw_string_ostream is always unbuffered and 2) the underlying buffer may be used directly
( 65b13610a5226b84889b923bae884ba395ad084d for further reference )
* Don't call raw_string_ostream::flush(), which is essentially a no-op. * Avoid unneeded calls to raw_string_ostream::str(), to avoid excess indirection.
show more ...
|
#
ece65660 |
| 30-Aug-2024 |
Pradeep Kumar <pradeepku@nvidia.com> |
[MLIR][NVVM] Add support for fence.proxy.{acquire, release} Ops (#106689)
|
#
948862b2 |
| 04-Jul-2024 |
bangyu shen <94283495+shubaoyu2@users.noreply.github.com> |
[mlir][nvvm] Fix the verifier of `wgmma.mma_async` wrt transposed layouts (#97538)
the WGMMA expect layouts for A/B are row/col, the transposed version
should be col/row. when checking other datat
[mlir][nvvm] Fix the verifier of `wgmma.mma_async` wrt transposed layouts (#97538)
the WGMMA expect layouts for A/B are row/col, the transposed version
should be col/row. when checking other datatypes cannot use transposed
layout, it should reject col-major for A and row-major for B
show more ...
|
#
db791b27 |
| 02-Jul-2024 |
Ramkumar Ramachandra <ramkumar.ramachandra@codasip.com> |
mlir/LogicalResult: move into llvm (#97309)
This patch is part of a project to move the Presburger library into
LLVM.
|
#
0cc3fe46 |
| 28-Jun-2024 |
Jeff Niu <jeff@modular.com> |
Revert "[mlir][NVVM] Disallow results on kernel functions (#96399)" (#97074)
NVVM IR itself doesn't place any restriction that a function annotated
as `nvvm.kernel` actually has no results, so this
Revert "[mlir][NVVM] Disallow results on kernel functions (#96399)" (#97074)
NVVM IR itself doesn't place any restriction that a function annotated
as `nvvm.kernel` actually has no results, so this is a mismatch at the
NVVMDialect level and NVVMIR. The GPU dialect might enforce that kernel
functions have no results, but it doesn't make sense to propagate this
constraint downstream.
Reverts llvm/llvm-project#96399
show more ...
|
#
346c4a88 |
| 23-Jun-2024 |
Matthias Springer <me@m-sp.org> |
[mlir][NVVM] Disallow results on kernel functions (#96399)
Functions that have the `nvvm.kernel` attribute should have 0 results.
|
#
35d55f28 |
| 27-Mar-2024 |
Justin Fargnoli <justinfargnoli@gmail.com> |
[NFC][mlir] Reorder `declarePromisedInterface()` operands (#86628)
Reorder the template operands of `declarePromisedInterface()` to match
`declarePromisedInterfaces()`.
|
#
8819f879 |
| 19-Mar-2024 |
Guray Ozen <guray.ozen@gmail.com> |
[MLIR][NVVM] Add barrier.arrive (#85412)
PR adds `nvvm.barrier.arrive` Op. It is useful op for producer consumer
modeling.
|
#
b5d694ba |
| 14-Feb-2024 |
Guray Ozen <guray.ozen@gmail.com> |
[mlir][nvvm] Introduce `nvvm.barrier` OP (#81487)
This PR that introduces the `nvvm.barrier` OP to the NVVM dialect.
Currently, NVVM only supports the `nvvm.barrier0`, which synchronizes
all threa
[mlir][nvvm] Introduce `nvvm.barrier` OP (#81487)
This PR that introduces the `nvvm.barrier` OP to the NVVM dialect.
Currently, NVVM only supports the `nvvm.barrier0`, which synchronizes
all threads using barrier resource 0.
The new `nvvm.barrier` has two essential arguments: the barrier resource
and the number of threads. This added flexibility allows for selective
synchronization of threads within a CTA, aligning with the capabilities
provided by LLVM intrinsics or the PTX model.
I think we can deprecate `nvvm.barrier0` in favor of the more generic
`nvvm.barrier`.
```
// Equivalent to nvvm.barrier0 (or __syncthreads() in CUDA)
nvvm.barrier
// Synchronize all threads using the 3rd barrier resource.
nvvm.barrier id = 3
// Synchronize %numberOfThreads threads using the 3rd barrier resource.
nvvm.barrier id = 3 number_of_threads = %numberOfThreads
```
show more ...
|
#
fa6850a9 |
| 12-Feb-2024 |
Rishi Surendran <142182875+rishisurendran@users.noreply.github.com> |
[mlir][nvvm]Add support for grid_constant attribute on LLVM function arguments (#78228)
Add support for attribute nvvm.grid_constant on LLVM function arguments.
The attribute can be attached only t
[mlir][nvvm]Add support for grid_constant attribute on LLVM function arguments (#78228)
Add support for attribute nvvm.grid_constant on LLVM function arguments.
The attribute can be attached only to arguments of type llvm.ptr that
have llvm.byval attribute.
Generate LLVM metadata for functions with nvvm.grid_constant arguments.
The metadata node is a list of integers, where each integer n denotes
that the nth parameter has the
grid_constant annotation (numbering from 1). The generated metadata node
will be handled by NVVM compiler. See
https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html#supported-properties
for documentation on grid_constant property.
This patch also adds convertParameterAttr to
LLVMTranslationDialectInterface for supporting the translation of
derived dialect attributes on function parameters
show more ...
|
#
12c241b3 |
| 22-Jan-2024 |
Guray Ozen <guray.ozen@gmail.com> |
[MLIR][NVVM] Explicit Data Type for Output in `wgmma.mma_async` (#78713)
The current implementation of `nvvm.wgmma.mma_async` Op deduces the data
type of the output matrix from the data type of str
[MLIR][NVVM] Explicit Data Type for Output in `wgmma.mma_async` (#78713)
The current implementation of `nvvm.wgmma.mma_async` Op deduces the data
type of the output matrix from the data type of struct member, which can be
non-intuitive, especially in cases where types like `2xf16` are packed
into `i32`.
This PR addresses this issue by improving the Op to include an explicit
data type for the output matrix.
The modified Op now includes an explicit data type for Matrix-D (<f16>),
and looks as follows:
```
%result = llvm.mlir.undef : !llvm.struct<(struct<(i32, i32, ...
nvvm.wgmma.mma_async
%descA, %descB, %result,
#nvvm.shape<m = 64, n = 32, k = 16>,
D [<f16>, #nvvm.wgmma_scale_out<zero>],
A [<f16>, #nvvm.wgmma_scale_in<neg>, <col>],
B [<f16>, #nvvm.wgmma_scale_in<neg>, <col>]
```
show more ...
|
#
2aec7083 |
| 09-Jan-2024 |
Guray Ozen <guray.ozen@gmail.com> |
[mlir][gpu] Use DenseI32Array for NVVM's maxntid and reqntid (NFC) (#77466)
|
#
85b23271 |
| 22-Dec-2023 |
Adam Paszke <apaszke@google.com> |
[mlir][nvvm] Fix the PTX lowering of wgmma.mma_async (#76150)
|
#
80ff67be |
| 04-Dec-2023 |
Guray Ozen <guray.ozen@gmail.com> |
[mlir][nvvm] Introduce `nvvm.fence.proxy` (#74057)
This PR introduce `nvvm.fence.proxy` OP for the following cases:
```
nvvm.fence.proxy { kind = #nvvm.proxy_kind<alias>}
nvvm.fence.proxy { kin
[mlir][nvvm] Introduce `nvvm.fence.proxy` (#74057)
This PR introduce `nvvm.fence.proxy` OP for the following cases:
```
nvvm.fence.proxy { kind = #nvvm.proxy_kind<alias>}
nvvm.fence.proxy { kind = #nvvm.proxy_kind<async>}
nvvm.fence.proxy { kind = #nvvm.proxy_kind<async.global>}
nvvm.fence.proxy { kind = #nvvm.proxy_kind<async.shared>, space = #nvvm.shared_space<cta>}
nvvm.fence.proxy { kind = #nvvm.proxy_kind<async.shared>, space = #nvvm.shared_space<cluster>}
```
show more ...
|
#
68433f6b |
| 29-Nov-2023 |
Guray Ozen <guray.ozen@gmail.com> |
[mlir][nvvm] Introduce `setmaxregister.sync.aligned` Op (#73780)
This PR introduce `setmaxregister.sync.aligned` Op to increase or
decrease the register size.
https://docs.nvidia.com/cuda/para
[mlir][nvvm] Introduce `setmaxregister.sync.aligned` Op (#73780)
This PR introduce `setmaxregister.sync.aligned` Op to increase or
decrease the register size.
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#miscellaneous-instructions-setmaxnreg
show more ...
|
#
9ceea088 |
| 22-Nov-2023 |
Guray Ozen <guray.ozen@gmail.com> |
[mlir] `im2col` & `l2cache` on cp.async.bulk.tensor.shared.cluster.global` (#72967)
PR adds support of `im2col` and `l2cache` to
`cp.async.bulk.tensor.shared.cluster.global`. The Op is now supports
[mlir] `im2col` & `l2cache` on cp.async.bulk.tensor.shared.cluster.global` (#72967)
PR adds support of `im2col` and `l2cache` to
`cp.async.bulk.tensor.shared.cluster.global`. The Op is now supports all
the traits of the corresponding PTX instruction.
The current structure of this operation looks somewhat like below. The
PR also simplifies types so we don't need to write obvious types after
`:` anymore.
```
nvvm.cp.async.bulk.tensor.shared.cluster.global
%dest, %tmaDescriptor, %barrier,
box[%crd0,%crd1,%crd2,%crd3,%crd4]
im2col[%off0,%off1,%off2] <-- PR introduces
multicast_mask = %ctamask
l2_cache_hint = %cacheHint <-- PR introduces
: !llvm.ptr<3>, !llvm.ptr
```
show more ...
|
#
5316d19e |
| 19-Oct-2023 |
Guray Ozen <guray.ozen@gmail.com> |
[mlir][nvvm] Introduce `nvvm.stmatrix` Op (#69467)
This PR adds `nvvm.stmatrix` Op to NVVM dialect. The Op collectively
store one or more matrices across all threads in a warp to the given
address
[mlir][nvvm] Introduce `nvvm.stmatrix` Op (#69467)
This PR adds `nvvm.stmatrix` Op to NVVM dialect. The Op collectively
store one or more matrices across all threads in a warp to the given
address location in shared memory.
show more ...
|