1# MLIR Quantization 2 3This document outlines the design of the MLIR quantization system. While the 4term "quantization" is highly overloaded, in this case, it refers to a fairly 5narrow scope of techniques in use to enable conversion of floating-point 6computations to corresponding and plausible variants expressed in integer math 7for inference, as has historically been supported by low-bit depth inference 8engines such as TFLite, various accelerator hardware, and many DSPs. 9 10Much of this is inspired by the approach taken 11[in this paper](https://arxiv.org/abs/1712.05877) with many extensions and 12adaptations folded in. It specifically documents the positions that MLIR has 13taken on the topic, and is not a general reference. 14 15[TOC] 16 17## Uniform quantization 18 19The primary quantization mechanism supported by MLIR is a scheme which can 20express fixed point and affine transformations via uniformly spaced point on the 21[Real](https://en.wikipedia.org/wiki/Real_number) number line. 22 23Further, the scheme can be applied: 24 25* *per-layer* : Applying to every value within the target type. 26* *per-axis* (also called *per-channel*) : Applying individually to each index 27 along a specific axis of a tensor type. 28 29### Fixed point values 30 31[Fixed point](https://en.wikipedia.org/wiki/Fixed-point_arithmetic) values are a 32[Real](https://en.wikipedia.org/wiki/Real_number) number divided by a *scale*. 33We will call the result of the divided real the *scaled value*. 34 35$$ real\_value = scaled\_value * scale $$ 36 37The scale can be interpreted as the distance, in real units, between neighboring 38scaled values. For example, if the scale is $$ \pi $$, then fixed point values 39with this scale can only represent multiples of $$ \pi $$, and nothing in 40between. The maximum rounding error to convert an arbitrary Real to a fixed 41point value with a given $$ scale $$ is $$ \frac{scale}{2} $$. Continuing the 42previous example, when $$ scale = \pi $$, the maximum rounding error will be $$ 43\frac{\pi}{2} $$. 44 45Multiplication can be performed on scaled values with different scales, using 46the same algorithm as multiplication of real values (note that product scaled 47value has $$ scale_{product} = scale_{left \mbox{ } operand} * scale_{right 48\mbox{ } operand} $$). Addition can be performed on scaled values, so long as 49they have the same scale, using the same algorithm for addition of real values. 50This makes it convenient to represent scaled values on a computer as signed 51integers, and perform arithmetic on those signed integers, because the results 52will be correct scaled values. 53 54### Affine values 55 56Mathematically speaking, affine values are the result of 57[adding a Real-valued *zero point*, to a scaled value](https://en.wikipedia.org/wiki/Affine_transformation#Representation). 58Alternatively (and equivalently), subtracting a zero point from an affine value results in a 59scaled value: 60 61$$ real\_value = scaled\_value * scale = (affine\_value - zero\_point) * scale $$ 62 63Essentially, affine values are a shift of the scaled values by some constant 64amount. Arithmetic (i.e., addition, subtraction, multiplication, division) 65cannot, in general, be directly performed on affine values; they must first be 66[converted](#affine-to-fixed-point) to the equivalent scaled values. 67 68As alluded to above, the motivation for using affine values is to more 69efficiently represent real values that will actually be encountered during 70computation. Frequently, real values that will be encountered are not 71symmetric around the real zero. We also make the assumption that the real zero 72is encountered during computation, and should thus be represented. 73 74In this case, it is inefficient to store scaled values represented by signed 75integers, as some of the signed integers will never be used. In effect, the bit patterns 76corresponding to those signed integers are going to waste. 77 78In order to exactly represent the real zero with an integral-valued affine 79value, the zero point must be an integer between the minimum and maximum affine 80value (inclusive). For example, given an affine value represented by an 8 bit 81unsigned integer, we have: $$ 0 \leq zero\_point \leq 255$$. This is important, 82because in convolution-like operations of deep neural networks, we frequently 83need to zero-pad inputs and outputs, so zero must be exactly representable, or 84the result will be biased. 85 86### Relation 87 88Real values, fixed point values, and affine values relate through the following 89equation, which demonstrates how to convert one type of number to another: 90 91$$ real\_value = scaled\_value * scale = (affine\_value - zero\_point) * scale $$ 92 93Note that computers generally store mathematical values using a finite number of 94bits. Thus, while the above conversions are exact, to store the result in a 95finite number of bits, we must, in general, round the result of the conversion 96(this applies to both cases: storing using floating point and storing using 97fixed point). Note that a full discussion of rounding behavior is outside the 98scope of this document, and it is safe to assume unless otherwise stated that 99rounding should be according to the IEEE754 default of RNE (where hardware 100permits). 101 102### Converting between real and fixed point or affine 103 104To convert a real value to a fixed point value, we must know the scale. To 105convert a real value to an affine value, we must know the scale and the zero point. 106 107#### Real to affine 108 109To convert an input tensor of real-valued elements (usually represented by a 110floating point format, frequently 111[Single precision](https://en.wikipedia.org/wiki/Single-precision_floating-point_format)) 112to a tensor of affine elements represented by an integral type (e.g. 8-bit 113unsigned integer), the following conversion can be performed (note that it is 114not required that all representable values of the integral type are used): 115 116$$ 117\begin{align*} 118af&fine\_value_{uint8 \, or \, uint16} \\ 119 &= clampToTargetSize(roundToNearestInteger( \frac{real\_value_{Single}}{scale_{Single}})_{sint32} + zero\_point_{uint8 \, or \, uint16}) 120\end{align*} 121$$ 122 123In the above, we assume that $$real\_value$$ is a Single, $$scale$$ is a Single, 124$$roundToNearestInteger$$ returns a signed 32-bit integer, and $$zero\_point$$ 125is an unsigned 8-bit or 16-bit integer. Note that bit depth and number of fixed 126point values are indicative of common types on typical hardware but is not 127constrained to particular bit depths or a requirement that the entire range of 128an N-bit integer is used. 129 130#### Affine to real 131 132To convert an output tensor of affine elements represented by uint8 133or uint16 to a tensor of real-valued elements (usually represented with a 134floating point format, frequently Single precision), the following conversion 135can be performed: 136 137$$ 138\begin{align*} 139re&al\_value_{Single} \\ 140 &= roundToNearestFloat((affine\_value_{uint8 \, or \, uint16} - zero\_point_{uint8 \, or \, uint16})_{sint32})_{Single} * scale_{Single} 141\end{align*} 142$$ 143 144In the above, we assume that the result of subtraction is in 32-bit signed 145integer format, and that $$roundToNearestFloat$$ returns a Single. 146 147#### Affine to fixed point 148 149When the affine and fixed point scales are the same, subtract the zero point 150from the affine value to get the equivalent fixed point value. 151 152$$ 153scaled\_value = affine\_value_{non\mbox{-}negative} - zero\_point_{non\mbox{-}negative} 154$$ 155 156#### Fixed point to affine 157 158When the affine and fixed point scales are the same, add the zero point to the 159fixed point value to get the equivalent affine value. 160 161$$ 162affine\_value_{non\mbox{-}negative} = scaled\_value + zero\_point_{non\mbox{-}negative} 163$$ 164 165## Usage within MLIR 166 167There are several components to the quantization system being developed within 168MLIR: 169 170* *Quantization* dialect containing: 171 172 * A family of [QuantizedTypes](#quantized-type) which represent the 173 mapping between *expressed* values (typically of a floating point 174 computer type) and *storage* values (typically of an integral computer 175 type). 176 * [Type conversion ops](#quantized-type-conversion-ops) for converting 177 between types based on a QuantizedType and its *expressed* and *storage* 178 sub-types. 179 * [Instrumentation ops](#instrumentation-and-constraint-ops) for assigning 180 instrumentation points within the computation where runtime statistics 181 may help guide the quantization process. 182 183* [Integration with simulated quantization at training time](#integration-with-simulated-quantization-at-training-time) 184 185* [TFLite native quantization](#tflite-native-quantization) 186 187 * The TFLite op-set natively supports uniform-quantized variants. 188 * Passes and tools exist to convert directly from the *TensorFlow* dialect 189 to the TFLite quantized operation set. 190 191* [*FxpMath* dialect](#fxpmath-dialect) containing (experimental) generalized 192 representations of fixed-point math operations and conversions: 193 194 * [Real math ops](#real-math-ops) representing common combinations of 195 arithmetic operations that closely match corresponding fixed-point math 196 concepts (as opposed to being spread across multiple ops as is typical 197 in source dialects). 198 * [Fixed-point math ops](#fixed-point-math-ops) that for carrying out 199 computations on integers, as are typically needed by uniform 200 quantization schemes. 201 * Passes to lower from real math operations to fixed-point math operations. 202 203* [Solver tools](#solver-tools) which can (experimentally and generically 204 operate on computations expressed in the *FxpMath* dialect in order to 205 convert from floating point types to appropriate *QuantizedTypes*, allowing 206 the computation to be further lowered to integral math operations. 207 208Not every application of quantization will use all of these facilities. Specifically, the 209TensorFlow to TensorFlow Lite conversion uses the QuantizedTypes but has its own 210operations for type conversion and expression of the supporting math. 211 212## Quantization Dialect 213 214### Quantized type 215 216TODO : Flesh this section out. 217 218* QuantizedType base class 219* UniformQuantizedType 220 221### Quantized type conversion operations 222 223* qcast : Convert from an expressed type to QuantizedType 224* dcast : Convert from a QuantizedType to its expressed type 225* scast : Convert between a QuantizedType and its storage type 226 227### Instrumentation and constraint operations 228 229* const_fake_quant : Emulates the logic of the historic TensorFlow 230 fake_quant_with_min_max_args operation. 231* stats_ref : Declares that statistics should be gathered at this point with a 232 unique key and made available to future passes of the solver. 233* stats : Declares inline statistics (per layer and per axis) for the point in 234 the computation. stats_ref ops are generally converted to statistical operations once 235 trial runs have been performed. 236* coupled_ref : Declares points in the computation to be coupled from a type 237 inference perspective based on a unique key. 238 239## Integration with simulated quantization at training time 240 241TensorFlow has historically used the 242[tf.quantization.fake_quant_\*](https://www.tensorflow.org/api_docs/python/tf/quantization/fake_quant_with_min_max_args) 243family of operations to simulate the effect of quantization at training time. 244 245As originally implemented, TensorFlow Lite was the primary user of such 246operations at inference time. When quantized inference was enabled, if every 247eligible tensor passed through an appropriate fake_quant node (the rules of 248which tensors can have fake_quant applied are somewhat involved), then 249TensorFlow Lite would use the attributes of the fake_quant operations to make a 250judgment about how to convert to use kernels from its quantized operations subset. 251 252In MLIR-based quantization, fake_quant_\* operationss are handled by converting them to 253a sequence of *qcast* (quantize) followed by *dcast* (dequantize) with an 254appropriate *UniformQuantizedType* as the target of the qcast operation. 255 256This allows subsequent compiler passes to preserve the knowledge that 257quantization was simulated in a certain way, while giving the compiler 258flexibility to move the casts as it simplifies the computation and converts it 259to a form based on integral arithmetic. 260 261This scheme also naturally allows computations that are *partially quantized* 262where the parts which could not be reduced to integral operationss are still carried out 263in floating point with appropriate conversions at the boundaries. 264 265## TFLite native quantization 266 267TODO : Flesh this out 268 269### General algorithm 270 2711. Take input min/max information and set the ArrayInfo (which really is 272 InputOrOutputArrayInfo. 2731. In LegalizeTF, convert ArrayInfo min/max to tf.Quantize and tf.Dequantize 274 nodes. (or tf.FakeQuant) Convert all constant FakeQuants to (tf.FQ -> tfl.Q 275 -> tfl.DQ). 2761. Hardcode logic/propagation needs to happen here. 2771. Run TF constant folding. 2781. In PrepareTFL, convert all tf.FQ to (tfl.Q -> tfl.DQ). 2791. Run quantization pass that take (tfl.DQ (for both input and weights) -> op 280 -> tfl.Q) and replaces with (op). Also replace (constant_float -> tfl.Q) 281 with (constant_quant). 282 283## FxpMath dialect 284 285### Real math operations 286 287Note that these all support explicit clamps, which allows for simple fusions and 288representation of some common sequences quantization-compatible math. Of 289addition, some support explicit biases, which are often represented as separate 290adds in source dialects. 291 292TODO: This operation set is still evolving and needs to be completed. 293 294* RealBinaryOp 295 * RealAddEwOp 296 * RealSubEwOp 297 * RealMulEwOp 298 * RealDivEwOp 299* RealUnaryOp 300 * IDENTITY 301 * TANH 302 * SIGMOID 303 * EXP 304 * LOG 305 * NEG 306 * RSQRT 307 * SIN 308 * SQUARE 309 * SQRT 310 * CMPZ 311 * CMPNZ 312 * CMPLZ 313 * CMPGZ 314 315### Fixed-point math operationss 316 317TODO: This operation set only has enough operations to lower a simple power-of-two 318RealAddEwOp. 319 320* RoundingDivideByPotFxpOp 321* SaturatingAddFxpOp 322 323## Solver tools 324 325Solver tools exist to analyze an MLIR-computation, expressed in either a 326supported source dialect or in the *real math ops* set and solve for appropriate 327QuantizedTypes that allow the computation to be lowered to integral math. 328 329These tools are an active area of work and may be expanded in the future to 330adjacent areas such as solving for transformations to other kinds of lower 331precision types (i.e. bfloat16 or fp16). 332 333Solver tools are expected to operate in several modes, depending on the 334computation and the training characteristics of the model: 335 336* *Transform* : With all available information in the MLIR computation, infer 337 boundaries where the computation can be carried out with integral math and 338 change types accordingly to appropriate QuantizedTypes: 339 340 * For passthrough ops which do not perform active math, change them to 341 operate directly on the storage type, converting in and out at the edges 342 via scast operations. 343 * For operations that have the *Quantizable* trait, the type can be set directly. 344 This includes operations from the [real math ops set]{#real-math-ops}. 345 * For others, encase them in appropriate dcast/qcast operations, presuming that 346 some follow-on pass will know what to do with them. 347 348* *Instrument* : Most of the time, there are not sufficient implied 349 constraints within a computation to perform many transformations. For this 350 reason, the solver can insert instrumentation operations at points where additional 351 runtime statistics may yield solutions. It is expected that such 352 computations will be lowered as-is for execution, run over an appropriate 353 evaluation set, and statistics at each instrumentation point made available for a 354 future invocation of the solver. 355 356* *Simplify* : A variety of passes and simplifications are applied once 357 QuantizedTypes are added in order to arrive at a computation that is 358 expressed in as much integral math, with the fewest number of casts as 359 possible. 360