docs/Dialects/TOSA.md

# Tensor Operator Set Architecture (TOSA) Dialect

[TOC]

## Rationale

The MLIR TOSA dialect implements the [TOSA
specification](https://developer.mlplatform.org/w/tosa/).  This document
describes the decision process for how TOSA expresses operators in
high level dialects.

TOSA was developed after parallel efforts to rationalize the top-down picture
from multiple high-level frameworks, as well as a bottom-up view of different
hardware target concerns (CPU, GPU and NPU), and reflects a set of choices
that attempt to manage both sets of requirements.

## TOSA and Tensor Level Expressiveness

TOSA endeavors to provide an operator set that tries to fulfil the following
expressiveness goals at the *tensor level of abstraction* :

### Complete

This is driven by the top-down perspective, needing to express as much of
multiple high level frameworks fully in TOSA, as possible. This was originally
done from an operator frequency analysis done upon dozens of high level
networks in different frameworks, to select the most frequently occurring ones
and establish a common set of tensor-level operators that could express them.

TOSA categorizes its operator set into classes and attempts to address major
functional operations at the tensor level, including compute, reduction,
elementwise transformations, comparison and control flow.

### Minimal

This takes the bottom-up approach - keep the TOSA operator set minimal in
order to bound the design of hardware, operator kernels, code generation
strategies and associated considerations that effect the executability of TOSA
content.

In this regard TOSA seeks to avoid creating compound operators, instead
leaving it to compiler backend to fuse multiple TOSA ops if required. This
choice also benefits the numerical precision goal, since it is easier to fuse the
numerical functionality of successive operators, than to split the numerical
functionality of a compound operator.

### Numerical Precision

TOSA began as a means to address operator-level numerical precision for
code generation and hardware development. It therefore incorporates precision
detail into the operator set.

In this regard, TOSA operators are best understood as a combination of the visible
quantization information embedded within an operation, together with the
functional information about how that information is used, as described in the
specification of the operation.

## TOSA Operator Rationale

The general basis of selection of the operator set that constitutes TOSA is
described in the TOSA specification document  under Section 1.3 Operator
Selection. Explanation of the thinking behind some operators is listed here:

### COND\_IF and WHILE\_LOOP

Several neural networks express conditional control flow at the tensor level.
A survey of multiple high level frameworks indicated that conditional if and
a loop construct are common in all major frameworks, with some variation.
Since TOSA endeavors to be complete in expressing tensor level functionality
including control flow, it implements these constructs.

The COND\_IF and WHILE\_LOOP operators implement such structured control
flow forms and should be lowerable to corresponding ops in the scf dialect.
Since the dialect seeks to remain isomorphic with an external, serialized form,
the decision was to keep these ops in the dialect (as opposed to deferring
completely to scf), and this may be re-evaluated if this turns out to not yield
the expected value.

## Using TOSA In A Compiler

The TOSA specification describes each operator in functional detail. It is
expected that compilers that use TOSA will use its builders to construct the
operators so that the quantization information for the operator is correctly
generated.

The functional steps described in the pseudocode of the specification enables
the construction of code generation for that operation, or decisions on the
design of underlying hardware. The functional pseudocode also describes
how the quantization parameters are utilized within the operation.

### Quantization Parameters in Ops vs Tensors

TOSA uses the quantization parameters embedded in the input and output
tensors to construct the quantization attributes that sit within the operator.
Once these attributes are constructed, the quantization information within
the tensors are no longer necessary for code generation.

This enables the tensors to be subsequently interpreted simply as contiguous
buffers containing raw data, with no 'meta information' in the form of the
quantization_type. Precision related manipulation of the input or output are
instead described by the operator itself which describes, for example, when
the zero point is applied, or when the scale multiplication is done.

However, TOSA does *not* eliminate the existing MLIR QuantOps quantization
type information within the tensors; this leaves the choice of how to handle
quantization information, to later backend code generation steps.

Maintaining the ability to overlap these different representations of
quantization parameters (i.e. tensor-carried vs op-carried) is an important
capability when considering progressive lowering between uses that expect one
scheme vs the other.

## Operation definitions

[include "Dialects/TosaOps.md"]