History log of /llvm-project/mlir/lib/Dialect/LLVMIR/IR/NVVMDialect.cpp (Results 1 – 25 of 107)
Revision Date Author Comments
# aa295216 29-Jan-2025 Jay Foad <jay.foad@amd.com>

Fix typo "tranpose" (#124929)


# 2e6cc79f 23-Jan-2025 Durgadoss R <durgadossr@nvidia.com>

[MLIR][NVVM] Migrate CpAsyncOp to intrinsics (#123789)

Intrinsics are available for the 'cpSize'
variants also. So, this patch migrates the Op
to lower to the intrinsics for all cases.

* Update

[MLIR][NVVM] Migrate CpAsyncOp to intrinsics (#123789)

Intrinsics are available for the 'cpSize'
variants also. So, this patch migrates the Op
to lower to the intrinsics for all cases.

* Update the existing tests to check the lowering to intrinsics.
* Add newer cp_async_zfill tests to verify the lowering for the 'cpSize'
variants.
* Tidy-up CHECK lines in cp_async() function in nvvmir.mlir (NFC)

PTX spec link:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>

show more ...


# 6dcb2a09 17-Jan-2025 Durgadoss R <durgadossr@nvidia.com>

[MLIR][NVVM] Add Float to TF32 conversion Op (#123199)

PR #121507 added 'cvt' intrinsics to convert
float to tf32, with the valid set of rounding and
saturation modes. This PR adds an NVVM Dialect

[MLIR][NVVM] Add Float to TF32 conversion Op (#123199)

PR #121507 added 'cvt' intrinsics to convert
float to tf32, with the valid set of rounding and
saturation modes. This PR adds an NVVM Dialect Op
for the same.
* lit tests are added to verify the lowering to intrinsics.
* Negative tests are also added to check the error-handling of invalid
combinations.

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>

show more ...


# d03f35f9 13-Jan-2025 xiaoleis-nv <99947620+xiaoleis-nv@users.noreply.github.com>

[MLIR][NVVM] Fix the datatype error for nvvm.mma.sync when the operand is bf16 (#122664)

The PR fixes the datatype error for `nvvm.mma.sync` when the operand is
`bf16`. This operation originally re

[MLIR][NVVM] Fix the datatype error for nvvm.mma.sync when the operand is bf16 (#122664)

The PR fixes the datatype error for `nvvm.mma.sync` when the operand is
`bf16`. This operation originally requires the A/B type to be `f16x2`
for the `bf16` MMA. However, it violates the NVVM intrinsic
[[here](https://github.com/xiaoleis-nv/llvm-project/blob/372044ee09d39942925824f8f335aef40bfe92f0/llvm/include/llvm/IR/IntrinsicsNVVM.td#L119)],
where the A/B operand type should be `i32`. This is a bug, and there are
no tests in MLIR that cover this datatype.

```
// mma bf16 -> s32 @ m16n8k16/m16n8k8
!eq(gft,"m16n8k16:a:bf16") : !listsplat(llvm_i32_ty, 4),
!eq(gft,"m16n8k16:b:bf16") : !listsplat(llvm_i32_ty, 2),
!eq(gft,"m16n8k8:a:bf16") : !listsplat(llvm_i32_ty, 2),
!eq(gft,"m16n8k8:b:bf16") : [llvm_i32_ty],
```

This PR addresses this bug and adds tests to guarantee correctness.

Co-authored-by: Xiaolei Shi <xiaoleis@nvidia.com>

show more ...


# 66e41a1a 10-Jan-2025 Guray Ozen <guray.ozen@gmail.com>

[MLIR][NVVM] Declare InferIntRangeInterface for RangeableRegisterOp (#122263)


# b4b819ce 11-Dec-2024 Durgadoss R <durgadossr@nvidia.com>

[MLIR][NVVM] Add Op for TMA Store with reduction (#118853)

PR #116854 adds intrinsics for TMA Store with reduction.
This patch adds an NVVM Dialect Op for the same.

* Lit tests are added to verify

[MLIR][NVVM] Add Op for TMA Store with reduction (#118853)

PR #116854 adds intrinsics for TMA Store with reduction.
This patch adds an NVVM Dialect Op for the same.

* Lit tests are added to verify the lowering to LLVM intrinsics and
invalid cases.
* The common verifier method is updated to handle im2col modes without
offsets.
This helps Ops like TMA Store, TMA StoreReduce etc.
* The nvvmir.mlir test file is already large. So, this patch adds the
tests for this Op
in a new file under a separate "nvvm/" directory.
[mlir/test/Target/LLVMIR/"nvvm"/tma_store_reduce.mlir]

PTX Spec reference:

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>

show more ...


# 81055ff0 20-Nov-2024 arthurqiu <arthurq@nvidia.com>

[mlir][nvvm] Add attributes for cluster dimension PTX directives (#116973)

PTX programming models provides cluster dimension directives, which are
leveraged by the downstream `ptxas` compiler. See

[mlir][nvvm] Add attributes for cluster dimension PTX directives (#116973)

PTX programming models provides cluster dimension directives, which are
leveraged by the downstream `ptxas` compiler. See
https://docs.nvidia.com/cuda/nvvm-ir-spec/#supported-properties and
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cluster-dimension-directives

This PR introduces the cluster dimension directives to MLIR's NVVM
dialect as listed below:
```
cluster_dim_{x,y,z} -> exact number of CTAs per cluster
cluster_max_blocks -> max number of CTAs per cluster
```

show more ...


# 1b23ebe0 15-Nov-2024 Durgadoss R <durgadossr@nvidia.com>

[MLIR][NVVM] Add Op for TMA Prefetch (#116232)

PR #115527 adds intrinsics for TMA prefetch.
This patch adds an NVVM Dialect Op for the same.

Lit tests to verify the lowering to LLVM intrinsics a

[MLIR][NVVM] Add Op for TMA Prefetch (#116232)

PR #115527 adds intrinsics for TMA prefetch.
This patch adds an NVVM Dialect Op for the same.

Lit tests to verify the lowering to LLVM intrinsics as well as
verifier tests (for invalid cases) are added.

PTX Spec reference:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-prefetch-tensor

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>

show more ...


# 884221ed 16-Sep-2024 JOE1994 <joseph942010@gmail.com>

[mlir] Tidy uses of llvm::raw_stream_ostream (NFC)

As specified in the docs,
1) raw_string_ostream is always unbuffered and
2) the underlying buffer may be used directly

( 65b13610a5226b84889b923ba

[mlir] Tidy uses of llvm::raw_stream_ostream (NFC)

As specified in the docs,
1) raw_string_ostream is always unbuffered and
2) the underlying buffer may be used directly

( 65b13610a5226b84889b923bae884ba395ad084d for further reference )

* Don't call raw_string_ostream::flush(), which is essentially a no-op.
* Avoid unneeded calls to raw_string_ostream::str(), to avoid excess indirection.

show more ...


# ece65660 30-Aug-2024 Pradeep Kumar <pradeepku@nvidia.com>

[MLIR][NVVM] Add support for fence.proxy.{acquire, release} Ops (#106689)


# 948862b2 04-Jul-2024 bangyu shen <94283495+shubaoyu2@users.noreply.github.com>

[mlir][nvvm] Fix the verifier of `wgmma.mma_async` wrt transposed layouts (#97538)

the WGMMA expect layouts for A/B are row/col, the transposed version
should be col/row. when checking other datat

[mlir][nvvm] Fix the verifier of `wgmma.mma_async` wrt transposed layouts (#97538)

the WGMMA expect layouts for A/B are row/col, the transposed version
should be col/row. when checking other datatypes cannot use transposed
layout, it should reject col-major for A and row-major for B

show more ...


# db791b27 02-Jul-2024 Ramkumar Ramachandra <ramkumar.ramachandra@codasip.com>

mlir/LogicalResult: move into llvm (#97309)

This patch is part of a project to move the Presburger library into
LLVM.


# 0cc3fe46 28-Jun-2024 Jeff Niu <jeff@modular.com>

Revert "[mlir][NVVM] Disallow results on kernel functions (#96399)" (#97074)

NVVM IR itself doesn't place any restriction that a function annotated
as `nvvm.kernel` actually has no results, so this

Revert "[mlir][NVVM] Disallow results on kernel functions (#96399)" (#97074)

NVVM IR itself doesn't place any restriction that a function annotated
as `nvvm.kernel` actually has no results, so this is a mismatch at the
NVVMDialect level and NVVMIR. The GPU dialect might enforce that kernel
functions have no results, but it doesn't make sense to propagate this
constraint downstream.

Reverts llvm/llvm-project#96399

show more ...


# 346c4a88 23-Jun-2024 Matthias Springer <me@m-sp.org>

[mlir][NVVM] Disallow results on kernel functions (#96399)

Functions that have the `nvvm.kernel` attribute should have 0 results.


# 35d55f28 27-Mar-2024 Justin Fargnoli <justinfargnoli@gmail.com>

[NFC][mlir] Reorder `declarePromisedInterface()` operands (#86628)

Reorder the template operands of `declarePromisedInterface()` to match
`declarePromisedInterfaces()`.


# 8819f879 19-Mar-2024 Guray Ozen <guray.ozen@gmail.com>

[MLIR][NVVM] Add barrier.arrive (#85412)

PR adds `nvvm.barrier.arrive` Op. It is useful op for producer consumer
modeling.


# b5d694ba 14-Feb-2024 Guray Ozen <guray.ozen@gmail.com>

[mlir][nvvm] Introduce `nvvm.barrier` OP (#81487)

This PR that introduces the `nvvm.barrier` OP to the NVVM dialect.
Currently, NVVM only supports the `nvvm.barrier0`, which synchronizes
all threa

[mlir][nvvm] Introduce `nvvm.barrier` OP (#81487)

This PR that introduces the `nvvm.barrier` OP to the NVVM dialect.
Currently, NVVM only supports the `nvvm.barrier0`, which synchronizes
all threads using barrier resource 0.

The new `nvvm.barrier` has two essential arguments: the barrier resource
and the number of threads. This added flexibility allows for selective
synchronization of threads within a CTA, aligning with the capabilities
provided by LLVM intrinsics or the PTX model.

I think we can deprecate `nvvm.barrier0` in favor of the more generic
`nvvm.barrier`.

```
// Equivalent to nvvm.barrier0 (or __syncthreads() in CUDA)
nvvm.barrier

// Synchronize all threads using the 3rd barrier resource.
nvvm.barrier id = 3

// Synchronize %numberOfThreads threads using the 3rd barrier resource.
nvvm.barrier id = 3 number_of_threads = %numberOfThreads
```

show more ...


# fa6850a9 12-Feb-2024 Rishi Surendran <142182875+rishisurendran@users.noreply.github.com>

[mlir][nvvm]Add support for grid_constant attribute on LLVM function arguments (#78228)

Add support for attribute nvvm.grid_constant on LLVM function arguments.
The attribute can be attached only t

[mlir][nvvm]Add support for grid_constant attribute on LLVM function arguments (#78228)

Add support for attribute nvvm.grid_constant on LLVM function arguments.
The attribute can be attached only to arguments of type llvm.ptr that
have llvm.byval attribute.
Generate LLVM metadata for functions with nvvm.grid_constant arguments.
The metadata node is a list of integers, where each integer n denotes
that the nth parameter has the
grid_constant annotation (numbering from 1). The generated metadata node
will be handled by NVVM compiler. See
https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html#supported-properties
for documentation on grid_constant property.

This patch also adds convertParameterAttr to
LLVMTranslationDialectInterface for supporting the translation of
derived dialect attributes on function parameters 

show more ...


# 12c241b3 22-Jan-2024 Guray Ozen <guray.ozen@gmail.com>

[MLIR][NVVM] Explicit Data Type for Output in `wgmma.mma_async` (#78713)

The current implementation of `nvvm.wgmma.mma_async` Op deduces the data
type of the output matrix from the data type of str

[MLIR][NVVM] Explicit Data Type for Output in `wgmma.mma_async` (#78713)

The current implementation of `nvvm.wgmma.mma_async` Op deduces the data
type of the output matrix from the data type of struct member, which can be
non-intuitive, especially in cases where types like `2xf16` are packed
into `i32`.

This PR addresses this issue by improving the Op to include an explicit
data type for the output matrix.

The modified Op now includes an explicit data type for Matrix-D (<f16>),
and looks as follows:

```
%result = llvm.mlir.undef : !llvm.struct<(struct<(i32, i32, ...
nvvm.wgmma.mma_async
%descA, %descB, %result,
#nvvm.shape<m = 64, n = 32, k = 16>,
D [<f16>, #nvvm.wgmma_scale_out<zero>],
A [<f16>, #nvvm.wgmma_scale_in<neg>, <col>],
B [<f16>, #nvvm.wgmma_scale_in<neg>, <col>]
```

show more ...


# 2aec7083 09-Jan-2024 Guray Ozen <guray.ozen@gmail.com>

[mlir][gpu] Use DenseI32Array for NVVM's maxntid and reqntid (NFC) (#77466)


# 85b23271 22-Dec-2023 Adam Paszke <apaszke@google.com>

[mlir][nvvm] Fix the PTX lowering of wgmma.mma_async (#76150)


# 80ff67be 04-Dec-2023 Guray Ozen <guray.ozen@gmail.com>

[mlir][nvvm] Introduce `nvvm.fence.proxy` (#74057)

This PR introduce `nvvm.fence.proxy` OP for the following cases:

```
nvvm.fence.proxy { kind = #nvvm.proxy_kind<alias>}
nvvm.fence.proxy { kin

[mlir][nvvm] Introduce `nvvm.fence.proxy` (#74057)

This PR introduce `nvvm.fence.proxy` OP for the following cases:

```
nvvm.fence.proxy { kind = #nvvm.proxy_kind<alias>}
nvvm.fence.proxy { kind = #nvvm.proxy_kind<async>}
nvvm.fence.proxy { kind = #nvvm.proxy_kind<async.global>}
nvvm.fence.proxy { kind = #nvvm.proxy_kind<async.shared>, space = #nvvm.shared_space<cta>}
nvvm.fence.proxy { kind = #nvvm.proxy_kind<async.shared>, space = #nvvm.shared_space<cluster>}
```

show more ...


# 68433f6b 29-Nov-2023 Guray Ozen <guray.ozen@gmail.com>

[mlir][nvvm] Introduce `setmaxregister.sync.aligned` Op (#73780)

This PR introduce `setmaxregister.sync.aligned` Op to increase or
decrease the register size.


https://docs.nvidia.com/cuda/para

[mlir][nvvm] Introduce `setmaxregister.sync.aligned` Op (#73780)

This PR introduce `setmaxregister.sync.aligned` Op to increase or
decrease the register size.


https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#miscellaneous-instructions-setmaxnreg

show more ...


# 9ceea088 22-Nov-2023 Guray Ozen <guray.ozen@gmail.com>

[mlir] `im2col` & `l2cache` on cp.async.bulk.tensor.shared.cluster.global` (#72967)

PR adds support of `im2col` and `l2cache` to
`cp.async.bulk.tensor.shared.cluster.global`. The Op is now supports

[mlir] `im2col` & `l2cache` on cp.async.bulk.tensor.shared.cluster.global` (#72967)

PR adds support of `im2col` and `l2cache` to
`cp.async.bulk.tensor.shared.cluster.global`. The Op is now supports all
the traits of the corresponding PTX instruction.

The current structure of this operation looks somewhat like below. The
PR also simplifies types so we don't need to write obvious types after
`:` anymore.
```
nvvm.cp.async.bulk.tensor.shared.cluster.global
%dest, %tmaDescriptor, %barrier,
box[%crd0,%crd1,%crd2,%crd3,%crd4]
im2col[%off0,%off1,%off2] <-- PR introduces
multicast_mask = %ctamask
l2_cache_hint = %cacheHint <-- PR introduces
: !llvm.ptr<3>, !llvm.ptr
```

show more ...


# 5316d19e 19-Oct-2023 Guray Ozen <guray.ozen@gmail.com>

[mlir][nvvm] Introduce `nvvm.stmatrix` Op (#69467)

This PR adds `nvvm.stmatrix` Op to NVVM dialect. The Op collectively
store one or more matrices across all threads in a warp to the given
address

[mlir][nvvm] Introduce `nvvm.stmatrix` Op (#69467)

This PR adds `nvvm.stmatrix` Op to NVVM dialect. The Op collectively
store one or more matrices across all threads in a warp to the given
address location in shared memory.

show more ...


12345