xref: /llvm-project/mlir/docs/Dialects/Vector.md (revision ec1981f4ed86a6f954a5ea0bbfaba1c6cd19d807)
1# 'vector' Dialect
2
3[TOC]
4
5MLIR supports multi-dimensional `vector` types and custom operations on those
6types. A generic, retargetable, higher-order `vector` type (`n-D` with `n > 1`)
7is a structured type, that carries semantic information useful for
8transformations. This document discusses retargetable abstractions that exist in
9MLIR today and operate on ssa-values of type `vector` along with pattern
10rewrites and lowerings that enable targeting specific instructions on concrete
11targets. These abstractions serve to separate concerns between operations on
12`memref` (a.k.a buffers) and operations on `vector` values. This is not a new
13proposal but rather a textual documentation of existing MLIR components along
14with a rationale.
15
16## Positioning in the Codegen Infrastructure
17
18The following diagram, recently presented with the
19[StructuredOps abstractions](https://drive.google.com/corp/drive/u/0/folders/1sRAsgsd8Bvpm_IxREmZf2agsGU2KvrK-),
20captures the current codegen paths implemented in MLIR in the various existing
21lowering paths.
22![](https://user-images.githubusercontent.com/10148468/71177417-f78e4d80-2239-11ea-92ef-700f42ea503f.png)
23
24The following diagram seeks to isolate `vector` dialects from the complexity of
25the codegen paths and focus on the payload-carrying ops that operate on std and
26`vector` types. This diagram is not to be taken as set in stone and
27representative of what exists today but rather illustrates the layering of
28abstractions in MLIR.
29
30![`vector` Abstractions in MLIR](https://user-images.githubusercontent.com/10148468/71176949-e85ad000-2238-11ea-9806-200843bc4943.png)
31
32This  separates concerns related to (a) defining efficient operations on
33`vector` types from (b) program analyses + transformations on `memref`, loops
34and other types of structured ops (be they `HLO`, `LHLO`, `Linalg` or other ).
35Looking a bit forward in time, we can put a stake in the ground and venture that
36the higher level of `vector`-level primitives we build and target from codegen
37(or some user/language level), the simpler our task will be, the more complex
38patterns can be expressed and the better performance will be.
39
40## Components of a Generic Retargetable Vector-Level Dialect
41
42The existing MLIR `vector`-level dialects are related to the following bottom-up
43abstractions:
44
451.  Representation in `LLVMIR` via data structures, instructions and intrinsics.
46    This is referred to as the `LLVM` level.
472.  Set of machine-specific operations and types that are built to translate
48    almost 1-1 with the HW ISA. This is referred to as the Hardware Vector
49    level; a.k.a `HWV`. For instance, we have (a) the `NVVM` dialect (for
50    `CUDA`) with tensor core ops, (b) accelerator-specific dialects (internal),
51    a potential (future) `CPU` dialect to capture `LLVM` intrinsics more closely
52    and other dialects for specific hardware. Ideally this should be
53    auto-generated as much as possible from the `LLVM` level.
543.  Set of virtual, machine-agnostic, operations that are informed by costs at
55    the `HWV`-level. This is referred to as the Virtual Vector level; a.k.a
56    `VV`. This is the level that higher-level abstractions (codegen, automatic
57    vectorization, potential vector language, ...) targets.
58
59The existing generic, retargetable, `vector`-level dialect is related to the
60following top-down rewrites and conversions:
61
621.  MLIR Rewrite Patterns applied by the MLIR `PatternRewrite` infrastructure to
63    progressively lower to implementations that match closer and closer to the
64    `HWV`. Some patterns are "in-dialect" `VV -> VV` and some are conversions
65    `VV -> HWV`.
662.  `Virtual Vector -> Hardware Vector` lowering is specified as a set of MLIR
67    lowering patterns that are specified manually for now.
683.  `Hardware Vector -> LLVM` lowering is a mechanical process that is written
69    manually at the moment and that should be automated, following the `LLVM ->
70    Hardware Vector` ops generation as closely as possible.
71
72## Short Description of the Existing Infrastructure
73
74### LLVM level
75
76On CPU, the `n-D` `vector` type currently lowers to `!llvm<array<vector>>`.
77More concretely,
78* `vector<4x8x128xf32>` lowers to `!llvm<[4 x [ 8 x < 128
79x float >]]>` (fixed-width vector), and
80* `vector<4x8x[128]xf32>` lowers to `!llvm<[4 x [ 8 x < vscale x 128
81x float >]]>` (scalable vector).
82
83There are tradeoffs involved related to how one can access subvectors and how
84one uses `llvm.extractelement`, `llvm.insertelement` and `llvm.shufflevector`.
85The section on [LLVM Lowering Tradeoffs](#llvm-lowering-tradeoffs) offers a
86deeper dive into the current design choices and tradeoffs.
87
88Note, while LLVM supports arrarys of scalable vectors, these are required to be
89fixed-width arrays of 1-D scalable vectors. This means scalable vectors with a
90non-trailing scalable dimension (e.g. `vector<4x[8]x128xf32`) are not
91convertible to LLVM.
92
93Finally, MLIR takes the same view on scalable Vectors as LLVM (c.f.
94[VectorType](https://llvm.org/docs/LangRef.html#vector-type)):
95> For scalable vectors, the total number of elements is a constant multiple
96> (called vscale) of the specified number of elements; vscale is a positive
97> integer that is unknown at compile time and the same hardware-dependent
98> constant for all scalable vectors at run time.  The size of a specific
99> scalable vector type is thus constant within IR, even if the exact size in
100> bytes cannot be determined until run time.
101
102### Hardware Vector Ops
103
104Hardware Vector Ops are implemented as one dialect per target. For internal
105hardware, we are auto-generating the specific HW dialects. For `GPU`, the `NVVM`
106dialect adds operations such as `mma.sync`, `shfl` and tests. For `CPU` things
107are somewhat in-flight because the abstraction is close to `LLVMIR`. The jury is
108still out on  whether a generic `CPU` dialect is concretely needed, but it seems
109reasonable to have the same levels of abstraction for all targets and perform
110cost-based lowering decisions in MLIR even for `LLVM`. Specialized `CPU`
111dialects that would capture specific features not well captured by LLVM peephole
112optimizations of on different types that core MLIR supports (e.g. Scalable
113Vectors) are welcome future extensions.
114
115### Virtual Vector Ops
116
117Some existing Arith and Vector Dialect on `n-D` `vector` types comprise:
118
119```mlir
120// Produces a vector<3x7x8xf32>
121%a = arith.addf %0, %1 : vector<3x7x8xf32>
122// Produces a vector<3x7x8xf32>
123%b = arith.mulf %0, %1 : vector<3x7x8xf32>
124// Produces a vector<3x7x8xf32>
125%c = vector.splat %1 : vector<3x7x8xf32>
126
127%d = vector.extract %0[1]: vector<7x8xf32> from vector<3x7x8xf32>
128%e = vector.extract %0[1, 5]: vector<8xf32> from vector<3x7x8xf32>
129%f = vector.outerproduct %0, %1: vector<4xf32>, vector<8xf32>      // -> vector<4x8xf32>
130%g = vector.outerproduct %0, %1, %2: vector<4xf32>, vector<8xf32>  // fma when adding %2
131
132// Returns a slice of type vector<2x2x16xf32>
133%h = vector.strided_slice %0
134    {offsets = [2, 2], sizes = [2, 2], strides = [1, 1]}:
135  vector<4x8x16xf32>
136
137%i = vector.transfer_read %A[%0, %1]
138    {permutation_map = (d0, d1) -> (d0)}:
139  memref<7x?xf32>, vector<4xf32>
140
141vector.transfer_write %f1, %A[%i0, %i1, %i2, %i3]
142    {permutation_map = (d0, d1, d2, d3) -> (d3, d1, d0)} :
143  vector<5x4x3xf32>, memref<?x?x?x?xf32>
144```
145
146The list of Vector is currently undergoing evolutions and is best kept track of
147by following the evolution of the
148[VectorOps.td](https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/Dialect/Vector/IR/VectorOps.td)
149ODS file (markdown documentation is automatically generated locally when
150building and populates the
151[Vector doc](https://github.com/llvm/llvm-project/blob/main/mlir/docs/Dialects/Vector.md)).
152Recent extensions are driven by concrete use cases of interest. A notable such
153use case is the `vector.contract` op which applies principles of the
154StructuredOps abstraction to `vector` types.
155
156### Virtual Vector Rewrite Patterns
157
158The following rewrite patterns exist at the `VV->VV` level:
159
1601.  The now retired `MaterializeVector` pass used to legalize ops on a
161    coarse-grained virtual `vector` to a finer-grained virtual `vector` by
162    unrolling. This has been rewritten as a retargetable unroll-and-jam pattern
163    on `vector` ops and `vector` types.
1642.  The lowering of `vector_transfer` ops legalizes `vector` load/store ops to
165    permuted loops over scalar load/stores. This should evolve to loops over
166    `vector` load/stores + `mask` operations as they become available `vector`
167    ops at the `VV` level.
168
169The general direction is to add more Virtual Vector level ops and implement more
170useful `VV -> VV` rewrites as composable patterns that the PatternRewrite
171infrastructure can apply iteratively.
172
173### Virtual Vector to Hardware Vector Lowering
174
175For now, `VV -> HWV` are specified in C++ (see for instance the
176[SplatOpLowering for n-D vectors](https://github.com/tensorflow/mlir/commit/0a0c4867c6a6fcb0a2f17ef26a791c1d551fe33d)
177or the
178[VectorOuterProductOp lowering](https://github.com/tensorflow/mlir/commit/957b1ca9680b4aacabb3a480fbc4ebd2506334b8)).
179
180Simple
181[conversion tests](https://github.com/llvm/llvm-project/blob/main/mlir/test/Conversion/VectorToLLVM/vector-to-llvm.mlir)
182are available for the `LLVM` target starting from the Virtual Vector Level.
183
184## Rationale
185
186### Hardware as `vector` Machines of Minimum Granularity
187
188Higher-dimensional `vector`s are ubiquitous in modern HPC hardware. One way to
189think about Generic Retargetable `vector`-Level Dialect is that it operates on
190`vector` types that are multiples of a "good" `vector` size so the HW can
191efficiently implement a set of high-level primitives (e.g.
192`vector<8x8x8x16xf32>` when HW `vector` size is say `vector<4x8xf32>`).
193
194Some notable `vector` sizes of interest include:
195
1961.  CPU: `vector<HW_vector_size * k>`, `vector<core_count * k’ x
197    HW_vector_size * k>` and `vector<socket_count x core_count * k’ x
198    HW_vector_size * k>`
1992.  GPU: `vector<warp_size * k>`, `vector<warp_size * k x float4>` and
200    `vector<warp_size * k x 4 x 4 x 4>` for tensor_core sizes,
2013.  Other accelerators: n-D `vector` as first-class citizens in the HW.
202
203Depending on the target, ops on sizes that are not multiples of the HW `vector`
204size may either produce slow code (e.g. by going through `LLVM` legalization) or
205may not legalize at all (e.g. some unsupported accelerator X combination of ops
206and types).
207
208### Transformations Problems Avoided
209
210A `vector<16x32x64xf32>` virtual `vector` is a coarse-grained type that can be
211“unrolled” to HW-specific sizes. The multi-dimensional unrolling factors are
212carried in the IR by the `vector` type. After unrolling, traditional
213instruction-level scheduling can be run.
214
215The following key transformations (along with the supporting analyses and
216structural constraints) are completely avoided by operating on a `vector`
217`ssa-value` abstraction:
218
2191.  Loop unroll and unroll-and-jam.
2202.  Loop and load-store restructuring for register reuse.
2213.  Load to store forwarding and Mem2reg.
2224.  Coarsening (raising) from finer-grained `vector` form.
223
224Note that “unrolling” in the context of `vector`s corresponds to partial loop
225unroll-and-jam and not full unrolling. As a consequence this is expected to
226compose with SW pipelining where applicable and does not result in ICache blow
227up.
228
229### The Big Out-Of-Scope Piece: Automatic Vectorization
230
231One important piece not discussed here is automatic vectorization (automatically
232raising from scalar to n-D `vector` ops and types). The TL;DR is that when the
233first "super-vectorization" prototype was implemented, MLIR was nowhere near as
234mature as it is today. As we continue building more abstractions in `VV -> HWV`,
235there is an opportunity to revisit vectorization in MLIR.
236
237Since this topic touches on codegen abstractions, it is technically out of the
238scope of this survey document but there is a lot to discuss in light of
239structured op type representations and how a vectorization transformation can be
240reused across dialects. In particular, MLIR allows the definition of dialects at
241arbitrary levels of granularity and lends itself favorably to progressive
242lowering. The argument can be made that automatic vectorization on a loops + ops
243abstraction is akin to raising structural information that has been lost.
244Instead, it is possible to revisit vectorization as simple pattern rewrites,
245provided the IR is in a suitable form. For instance, vectorizing a
246`linalg.generic` op whose semantics match a `matmul` can be done
247[quite easily with a pattern](https://github.com/tensorflow/mlir/commit/bff722d6b59ab99b998f0c2b9fccd0267d9f93b5).
248In fact this pattern is trivial to generalize to any type of contraction when
249targeting the `vector.contract` op, as well as to any field (`+/*`, `min/+`,
250`max/+`, `or/and`, `logsumexp/+` ...) . In other words, by operating on a higher
251level of generic abstractions than affine loops, non-trivial transformations
252become significantly simpler and composable at a finer granularity.
253
254Irrespective of the existence of an auto-vectorizer, one can build a notional
255vector language based on the VectorOps dialect and build end-to-end models with
256expressing `vector`s in the IR directly and simple pattern-rewrites.
257[EDSC](https://github.com/llvm/llvm-project/blob/main/mlir/docs/EDSC.md)s
258provide a simple way of driving such a notional language directly in C++.
259
260## Bikeshed Naming Discussion
261
262There are arguments against naming an n-D level of abstraction `vector` because
263most people associate it with 1-D `vector`s. On the other hand, `vector`s are
264first-class n-D values in MLIR. The alternative name Tile has been proposed,
265which conveys higher-D meaning. But it also is one of the most overloaded terms
266in compilers and hardware. For now, we generally use the `n-D` `vector` name and
267are open to better suggestions.
268
269## 0D Vectors
270
271Vectors of dimension 0 (or _0-D vectors_ or _0D vectors_) are allowed inside
272MLIR. For instance, a `f32` vector containing one scalar can be denoted as
273`vector<f32>`. This is similar to the `tensor<f32>` type that is available in
274TensorFlow or the `memref<f32>` type that is available in MLIR.
275
276Generally, a 0D `vector` can be interpreted as a scalar. The benefit of 0D
277`vector`s, `tensor`s, and `memref`s is that they make it easier to lower code
278from various frontends such as TensorFlow and make it easier to handle corner
279cases such as unrolling a loop from 1D to 0D.
280
281## LLVM Lowering Tradeoffs
282
283This section describes the tradeoffs involved in lowering the MLIR n-D vector
284type and operations on it to LLVM-IR. Putting aside the
285[LLVM Matrix](http://lists.llvm.org/pipermail/llvm-dev/2018-October/126871.html)
286proposal for now, this assumes LLVM only has built-in support for 1-D vector.
287The relationship with the LLVM Matrix proposal is discussed at the end of this
288document.
289
290LLVM instructions are prefixed by the `llvm.` dialect prefix (e.g.
291`llvm.insertvalue`). Such ops operate exclusively on 1-D vectors and aggregates
292following the [LLVM LangRef](https://llvm.org/docs/LangRef.html). MLIR
293operations are prefixed by the `vector.` dialect prefix (e.g.
294`vector.insertelement`). Such ops operate exclusively on MLIR `n-D` `vector`
295types.
296
297### Alternatives For Lowering an n-D Vector Type to LLVM
298
299Consider a vector of rank n with static sizes `{s_0, ... s_{n-1}}` (i.e. an MLIR
300`vector<s_0x...s_{n-1}xf32>`). Lowering such an `n-D` MLIR vector type to an
301LLVM descriptor can be done by either:
302
3031.  Nested aggregate type of `1-D` vector:
304    `!llvm."[s_0x[s_1x[...<s_{n-1}xf32>]]]">` in the MLIR LLVM dialect (current
305    lowering in MLIR).
3062.  Flattening to a `1-D` vector: `!llvm<"(s_0*...*s_{n-1})xfloat">` in the MLIR
307    LLVM dialect.
3083.  A mix of both.
309
310There are multiple tradeoffs involved in choosing one or the other that we
311discuss. It is important to note that “a mix of both” immediately reduces to
312“nested aggregate type of 1-D vector” with a `vector.cast %0:
313vector<4x8x16x32xf32> to vector<4x4096xf32>` operation, that flattens the most
314"k" minor dimensions.
315
316### Constraints Inherited from LLVM (see LangRef)
317
318The first constraint was already mentioned: LLVM only supports `1-D` `vector`
319types natively. Additional constraints are related to the difference in LLVM
320between vector and
321[aggregate types](https://llvm.org/docs/LangRef.html#aggregate-types):
322> Aggregate Types are a subset of derived types that can contain multiple
323> member types. Arrays and structs are aggregate types. Vectors are not
324> considered to be aggregate types.
325
326This distinction is also reflected in some of the operations. For `1-D` vectors,
327the operations `llvm.extractelement`, `llvm.insertelement`, and
328`llvm.shufflevector` apply, with direct support for dynamic indices. For `n-D`
329vectors with `n>1`, and thus aggregate types at LLVM level, the more restrictive
330operations `llvm.extractvalue` and `llvm.insertvalue` apply, which only accept
331static indices. There is no direct shuffling support for aggregate types.
332
333The next sentence (cf. LangRef [structure
334type](https://llvm.org/docs/LangRef.html#structure-type)) illustrates a
335recurrent tradeoff, also found in MLIR, between
336“value types” (subject to SSA use-def chains) and “memory types” (subject to
337aliasing and side-effects):
338> Structures in memory are accessed using ‘load’ and ‘store’ by getting a
339> pointer to a field with the llvm.getelementptr instruction. Structures in
340> registers are accessed using the llvm.extractvalue and llvm.insertvalue
341> instructions.
342
343When transposing this to MLIR, `llvm.getelementptr` works on pointers to `n-D`
344vectors in memory. For `n-D`, vectors values that live in registers we can use
345`vector.extract` and `vector.insert` which do not accept dynamic indices. Note
346that this is consistent with hardware considerations as discussed below.
347
348An alternative is to use an LLVM `1-D` `vector` type for which one can use
349`llvm.extractelement`, `llvm.insertelement` and `llvm.shufflevector`. These
350operations accept dynamic indices. The implication is that one has to use a
351flattened lowering of an MLIR n-D vector to an LLVM 1-D vector.
352
353There are multiple tradeoffs involved that mix implications on the programming
354model, execution on actual HW and what is visible or hidden from codegen. They
355are discussed in the following sections.
356
357### Nested Aggregate
358
359Pros:
360
3611.  Natural encoding n-D vector -> (n-1)-D aggregate over 1-D vector.
3622.  No need for linearization / delinearization logic inserted everywhere.
3633.  `llvm.insertvalue`, `llvm.extractvalue` of `(n-k)-D` aggregate is natural.
3644.  `llvm.insertelement`, `llvm.extractelement`, `llvm.shufflevector` over `1-D`
365    vector type is natural.
366
367Cons:
368
3691.  `llvm.insertvalue` / `llvm.extractvalue` does not accept dynamic indices but
370    only static ones.
3712.  Dynamic indexing on the non-most-minor dimension requires roundtrips to
372    memory.
3733.  Special intrinsics and native instructions in LLVM operate on `1-D` vectors.
374    This is not expected to be a practical limitation thanks to a `vector.cast
375    %0: vector<4x8x16x32xf32> to vector<4x4096xf32>` operation, that flattens
376    the most minor dimensions (see the bigger picture in implications on
377    codegen).
378
379### Flattened 1-D Vector Type
380
381Pros:
382
3831.  `insertelement` / `extractelement` / `shufflevector` with dynamic indexing
384    is possible over the whole lowered `n-D` vector type.
3852.  Supports special intrinsics and native operations.
386
387Cons:
388
3891.  Requires linearization/delinearization logic everywhere, translations are
390    complex.
3912.  Hides away the real HW structure behind dynamic indexing: at the end of the
392    day, HW vector sizes are generally fixed and multiple vectors will be needed
393    to hold a vector that is larger than the HW.
3943.  Unlikely peephole optimizations will result in good code: arbitrary dynamic
395    accesses, especially at HW vector boundaries unlikely to result in regular
396    patterns.
397
398### Discussion
399
400#### HW Vectors and Implications on the SW and the Programming Model
401
402As of today, the LLVM model only support `1-D` vector types. This is
403unsurprising because historically, the vast majority of HW only supports `1-D`
404vector registers. We note that multiple HW vendors are in the process of
405evolving to higher-dimensional physical vectors.
406
407In the following discussion, let's assume the HW vector size is `1-D` and the SW
408vector size is `n-D`, with `n >= 1`. The same discussion would apply with `2-D`
409HW `vector` size and `n >= 2`. In this context, most HW exhibit a vector
410register file. The number of such vectors is fixed. Depending on the rank and
411sizes of the SW vector abstraction and the HW vector sizes and number of
412registers, an `n-D` SW vector type may be materialized by a mix of multiple
413`1-D` HW vector registers + memory locations at a given point in time.
414
415The implication of the physical HW constraints on the programming model are that
416one cannot index dynamically across hardware registers: a register file can
417generally not be indexed dynamically. This is because the register number is
418fixed and one either needs to unroll explicitly to obtain fixed register numbers
419or go through memory. This is a constraint familiar to CUDA programmers: when
420declaring a `private float a[4]`; and subsequently indexing with a *dynamic*
421value results in so-called **local memory** usage (i.e. roundtripping to
422memory).
423
424#### Implication on codegen
425
426MLIR `n-D` vector types are currently represented as `(n-1)-D` arrays of `1-D`
427vectors when lowered to LLVM. This introduces the consequences on static vs
428dynamic indexing discussed previously: `extractelement`, `insertelement` and
429`shufflevector` on `n-D` vectors in MLIR only support static indices. Dynamic
430indices are only supported on the most minor `1-D` vector but not the outer
431`(n-1)-D`. For other cases, explicit load / stores are required.
432
433The implications on codegen are as follows:
434
4351.  Loops around `vector` values are indirect addressing of vector values, they
436    must operate on explicit load / store operations over `n-D` vector types.
4372.  Once an `n-D` `vector` type is loaded into an SSA value (that may or may not
438    live in `n` registers, with or without spilling, when eventually lowered),
439    it may be unrolled to smaller `k-D` `vector` types and operations that
440    correspond to the HW. This level of MLIR codegen is related to register
441    allocation and spilling that occur much later in the LLVM pipeline.
4423.  HW may support >1-D vectors with intrinsics for indirect addressing within
443    these vectors. These can be targeted thanks to explicit `vector_cast`
444    operations from MLIR `k-D` vector types and operations to LLVM `1-D`
445    vectors + intrinsics.
446
447Alternatively, we argue that directly lowering to a linearized abstraction hides
448away the codegen complexities related to memory accesses by giving a false
449impression of magical dynamic indexing across registers. Instead we prefer to
450make those very explicit in MLIR and allow codegen to explore tradeoffs.
451Different HW will require different tradeoffs in the sizes involved in steps 1.,
4522. and 3.
453
454Decisions made at the MLIR level will have implications at a much later stage in
455LLVM (after register allocation). We do not envision to expose concerns related
456to modeling of register allocation and spilling to MLIR explicitly. Instead,
457each target will expose a set of "good" target operations and `n-D` vector
458types, associated with costs that `PatterRewriters` at the MLIR level will be
459able to target. Such costs at the MLIR level will be abstract and used for
460ranking, not for accurate performance modeling. In the future such costs will be
461learned.
462
463#### Implication on Lowering to Accelerators
464
465To target accelerators that support higher dimensional vectors natively, we can
466start from either `1-D` or `n-D` vectors in MLIR and use `vector.cast` to
467flatten the most minor dimensions to `1-D` `vector<Kxf32>` where `K` is an
468appropriate constant. Then, the existing lowering to LLVM-IR immediately
469applies, with extensions for accelerator-specific intrinsics.
470
471It is the role of an Accelerator-specific vector dialect (see codegen flow in
472the figure above) to lower the `vector.cast`. Accelerator -> LLVM lowering would
473then consist of a bunch of `Accelerator -> Accelerator` rewrites to perform the
474casts composed with `Accelerator -> LLVM` conversions + intrinsics that operate
475on `1-D` `vector<Kxf32>`.
476
477Some of those rewrites may need extra handling, especially if a reduction is
478involved. For example, `vector.cast %0: vector<K1x...xKnxf32> to vector<Kxf32>`
479when `K != K1 * … * Kn` and some arbitrary irregular `vector.cast %0:
480vector<4x4x17xf32> to vector<Kxf32>` may introduce masking and intra-vector
481shuffling that may not be worthwhile or even feasible, i.e. infinite cost.
482
483However `vector.cast %0: vector<K1x...xKnxf32> to vector<Kxf32>` when `K = K1 *
484… * Kn` should be close to a noop.
485
486As we start building accelerator-specific abstractions, we hope to achieve
487retargetable codegen: the same infra is used for CPU, GPU and accelerators with
488extra MLIR patterns and costs.
489
490#### Implication on calling external functions that operate on vectors
491
492It is possible (likely) that we additionally need to linearize when calling an
493external function.
494
495### Relationship to LLVM matrix type proposal.
496
497The LLVM matrix proposal was formulated 1 year ago but seemed to be somewhat
498stalled until recently. In its current form, it is limited to 2-D matrix types
499and operations are implemented with LLVM intrinsics. In contrast, MLIR sits at a
500higher level of abstraction and allows the lowering of generic operations on
501generic n-D vector types from MLIR to aggregates of 1-D LLVM vectors. In the
502future, it could make sense to lower to the LLVM matrix abstraction also for CPU
503even though MLIR will continue needing higher level abstractions.
504
505On the other hand, one should note that as MLIR is moving to LLVM, this document
506could become the unifying abstraction that people should target for 1-D vectors
507and the LLVM matrix proposal can be viewed as a subset of this work.
508
509### Conclusion
510
511The flattened 1-D vector design in the LLVM matrix proposal is good in a
512HW-specific world with special intrinsics. This is a good abstraction for
513register allocation, Instruction-Level-Parallelism and
514SoftWare-Pipelining/Modulo Scheduling optimizations at the register level.
515However MLIR codegen operates at a higher level of abstraction where we want to
516target operations on coarser-grained vectors than the HW size and on which
517unroll-and-jam is applied and patterns across multiple HW vectors can be
518matched.
519
520This makes “nested aggregate type of 1-D vector” an appealing abstraction for
521lowering from MLIR because:
522
5231.  it does not hide complexity related to the buffer vs value semantics and the
524    memory subsystem and
5252.  it does not rely on LLVM to magically make all the things work from a too
526    low-level abstraction.
527
528The use of special intrinsics in a `1-D` LLVM world is still available thanks to
529an explicit `vector.cast` op.
530
531## Operations
532
533[include "Dialects/VectorOps.md"]
534